Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Tiina Lindh-Knuutila,Timo Honkela

doi:10.1186/s40469-015-0001-1

Tiina Lindh-Knuutila, Timo Honkela

Open Access

https://doi.org/10.1186/s40469-015-0001-1

Copy DOI

Abstract

In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources. We apply two unsupervised learning methods, Independent Component Analysis (ICA), and probabilistic topic model at word level using Latent Dirichlet Allocation (LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries. In addition, we use the Self-Organizing Map to visualize the obtained representations. We show that both methods find a considerable amount of category information in an unsupervised way. Rather than only finding groups of similar words, they can automatically find a number of features that characterize words. The unsupervised methods are also used in exploration. They provide findings which go beyond the manually predefined label sets. In addition, we demonstrate how the Self-Organizing Map visualization can be used in exploration and further analysis. This article compares unsupervised learning methods and semantically labeled dictionaries. We show that these methods are able to find categorical information. In addition, they can further be used in an exploratory analysis. In general, information theoretically motivated and probabilistic methods provide results that are at a comparable level. Moveover, the automatic methods and human classifications give an access to semantic categorization that complement each other. Data-driven methods can furthermore be cost effective and adapt to a particular domain through appropriate choice of data sets.

Highlights

In this article, automatically generated and manually crafted semantic representations are compared
Battig Method performance In the first experiment, the Independent Component Analysis (ICA) and Latent Dirichlet Allocation (LDA) models were trained with the vector representations that correspond to the Battig vocabulary
Considering that the ICA method attempts to describe any kind of structure in the data, and the Battig vocabulary covers only 0.2% and the BLESS vocabulary only 0.8% of the vocabulary of 200 000 words, these results show that even partial labelings can be very useful when studying such a large dataset

Summary

Introduction

Automatically generated and manually crafted semantic representations are compared. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources. We explore the relationship between human and data-driven semantic similarity judgments. We aim to see a) whether the representations that are automatically generated in a data-driven manner coincide. Challenge of semantics Semantics is an intriguing and a challenging area of linguistics. Linguists and researchers in nearby disciplines have created a number of theories related to semantics. These theories have been used as frameworks for semantic description or for labeling of lexica and corpora (Cruse 1986).

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Cognitive Science	Publication Date: Jul 7, 2015
Citations: 38	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Cognitive Science

Lead the way for us

Similar Papers

Exploring neural motion transfer for unsupervised remote physiological measurement: A practicality study
Tianqi Liu ... Zhipeng Li
Digital Signal Processing | VOL. 150
Tianqi Liu, et. al.Tianqi Liu ... Zhipeng Li
22 Apr 2024
Digital Signal Processing | VOL. 150

Unsupervised learning on scientific ocean drilling datasets from the South China Sea
...
Frontiers of Earth Science | VOL. 13
, et. al. ...
04 Jun 2018
Frontiers of Earth Science | VOL. 13

Source-LDA: Enhancing Probabilistic Topic Models Using Prior Knowledge Sources
Justin Wood ... Wei Wang
-
Justin Wood, et. al.Justin Wood ... Wei Wang
01 Apr 2017
01 Apr 2017

A Semi-Supervised Text Clustering Algorithm with Word Distribution Weights
Jiayin Wei ... Yongbin Qin
-
Jiayin Wei, et. al.Jiayin Wei ... Yongbin Qin
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploratory analysis of semantic categories: comparing data-driven and human similarity judgments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Cognitive Science