Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Adam Wawrzyński,Julian Szymański

doi:10.3390/app11136113

Abstract

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models.

Highlights

The text representation methods were benchmarked in the document classification task and Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TFIDF) models were used were used as a baseline
To overcome the weaknesses of the hierarchical attention network and to enrich them with statistical features, we propose the Hierarchical Weighted Attention Network (HWAN)
We present the results of the experimental comparison of selected statistical models and neural networks for the document classification task

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Language is a natural way of exchanging information between people. It is a fast and convenient way of communicating, which explains the popularity of instant messengers used by millions of users every day. It is a way to store and exchange business, government, medical and research data in form of text documents. Because of its universal usage, humankind is generating a vast amount of text data every day due to the usage of the Internet

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jun 30, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data
Vasantha Kumari Garbhapu
Indian Journal of Science and Technology | VOL. 13
Vasantha Kumari GarbhapuVasantha Kumari Garbhapu
20 Nov 2020
Indian Journal of Science and Technology | VOL. 13

Automated Classification of Free-Text Radiology Reports: Using Different Feature Extraction Methods to Identify Fractures of the Distal Fibula.
Frank K Wacker ... Lena S Becker
RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren | VOL. 195
Frank K Wacker, et. al.Frank K Wacker ... Lena S Becker
09 May 2023
RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren | VOL. 195

TOPIC MODELING IN COVID-19 VACCINATION REFUSAL CASES USING LATENT DIRICHLET ALLOCATION AND LATENT SEMANTIC ANALYSIS
Ulfah Malihatin S ... Yulian Findawati
Jurnal Teknik Informatika (Jutif) | VOL. 4
Ulfah Malihatin S, et. al.Ulfah Malihatin S ... Yulian Findawati
03 Oct 2023
Jurnal Teknik Informatika (Jutif) | VOL. 4

Stock Return Prediction using Financial News: A Unified Sequence Model based on Hierarchical Attention and Long-Short Term Memory Networks
Haoling Chen ... Peng Liu
-
Haoling Chen, et. al.Haoling Chen ... Peng Liu
01 Nov 2021
01 Nov 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences