Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Jan Stas,Daniel Hladek,Daniel Zlacky,Jozef Juhar

doi:10.15598/aeee.v11i5.897

Jan Stas, Daniel Hladek + Show 2 more

Open Access

https://doi.org/10.15598/aeee.v11i5.897

Copy DOI

Abstract

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.

Highlights

One of the key problems of the text data gathered from the Internet is their thematic heterogeneity
In the case of domain-specific speech recognition and statistical language modeling, these unorganized text data bring into the process of training language models many ambiguities caused by the overestimating such n-gram probabilities that are typically unrelated with the area, in which the speech recognition is performed
Contemporary text categorization is usually based on topic detection with key word identification for categorization of text data into predefined domains [3] or text document clustering based on measuring similarity between two or more documents [4], [5] with using iterative or hierarchical clustering algorithms [6]

Summary

Introduction

One of the key problems of the text data gathered from the Internet is their thematic heterogeneity. Contemporary text categorization is usually based on topic detection with key word identification for categorization of text data into predefined domains [3] or text document clustering based on measuring similarity between two or more documents [4], [5] with using iterative or hierarchical clustering algorithms [6] Based on this knowledge, we propose an algorithm for text categorization, which classifies short segments (blocks of texts or paragraphs) from unorganized text corpora to the in-domain and out-of-domain data.

Text Corpora

Key Phrases Identification

Vector Space Model

Term Weighting

Automatic Thresholding

LVCSR Setup

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Advances in Electrical and Electronic Engineering	Publication Date: Nov 19, 2013
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Advances in Electrical and Electronic Engineering

Lead the way for us

Similar Papers

Bilingual Cluster Based Models for Statistical Machine Translation
H Yamamoto ... E Sumita
IEICE Transactions on Information and Systems | VOL. E91-D
H Yamamoto, et. al.H Yamamoto ... E Sumita
01 Mar 2008
IEICE Transactions on Information and Systems | VOL. E91-D

Classification of heterogeneous text data for robust domain-specific language modeling
Ján Staš ... Jozef Juhár
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014
Ján Staš, et. al.Ján Staš ... Jozef Juhár
15 Apr 2014
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014

Business Process Modeling: Defining Domain Specific Modeling Languages by Use of UML Profiles
Steen Brahe ... Kasper Østerbye
-
Steen Brahe, et. al.Steen Brahe ... Kasper Østerbye
01 Jan 2006
01 Jan 2006

Parallel noise eliminate: A parallel noise elimination algorithm for massive text categorization
Xiaojuan Hu ... Meng Li
Journal of Algorithms & Computational Technology | VOL. 12
Xiaojuan Hu, et. al.Xiaojuan Hu ... Meng Li
09 Jun 2018
Journal of Algorithms & Computational Technology | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Advances in Electrical and Electronic Engineering