Multilabel Text Classification in News Articles Using Long-Term Memory with Word2Vec

Winda Kurnia Sari Winda Kurnia Sari,Reza Firsandaya Malik Reza Firsandaya Malik,Iman Saladin B Azhar Iman Saladin B Azhar,Dian Palupi Rini

doi:10.29207/resti.v4i2.1655

Winda Kurnia Sari Winda Kurnia Sari, Reza Firsandaya Malik Reza Firsandaya Malik + Show 2 more

PDF Available

https://doi.org/10.29207/resti.v4i2.1655

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Multilabel text classification is a task of categorizing text into one or more categories. Like other machine learning, multilabel classification performance is limited to the small labeled data and leads to the difficulty of capturing semantic relationships. It requires a multilabel text classification technique that can group four labels from news articles. Deep Learning is a proposed method for solving problems in multilabel text classification techniques. Some of the deep learning methods used for text classification include Convolutional Neural Networks, Autoencoders, Deep Belief Networks, and Recurrent Neural Networks (RNN). RNN is one of the most popular architectures used in natural language processing (NLP) because the recurrent structure is appropriate for processing variable-length text. One of the deep learning methods proposed in this study is RNN with the application of the Long Short-Term Memory (LSTM) architecture. The models are trained based on trial and error experiments using LSTM and 300-dimensional words embedding features with Word2Vec. By tuning the parameters and comparing the eight proposed Long Short-Term Memory (LSTM) models with a large-scale dataset, to show that LSTM with features Word2Vec can achieve good performance in text classification. The results show that text classification using LSTM with Word2Vec obtain the highest accuracy is in the fifth model with 95.38, the average of precision, recall, and F1-score is 95. Also, LSTM with the Word2Vec feature gets graphic results that are close to good-fit on seventh and eighth models.

Highlights

Klasifikasi teks multilabel adalah tugas mengategorikan teks ke dalam satu atau lebih kategori
Long Short-Term Memory (LSTM) with the Word2Vec feature gets graphic results that are close to good-fit on seventh and eighth models
Conference on Acoustics, Speech and Signal Processing - Proceedings, 2015-August, 4470–4474

Summary

Metode Penelitian

Pemrosesan Bahasa Alami dengan banyak penerapan seperti sentimen analisis, pencarian informasi, perankingan, indexing dan klasifikasi dokumen [1][2][3]. Gagasan ini telah Ekstraksi fitur adalah bagian penting dari machine diperluas untuk menghitung embedding yang learning terutama untuk data teks. Mikolov memperkenalkan teknik yang lebih baik penelitian ini adalah RNN dengan penerapan arsitektur untuk mengekstraksi fitur dari teks menggunakan. Menangani masalah exploding dan vanishing gradient Pada penelitian ini menggunakan 300 dimensi yang dapat dihadapi saat melatih RNN tradisional [10]. RNN masuk dalam label dengan mengusulkan Label Embedding Attentive kategori deep learning karena data diproses secara Model (LEAM) untuk meningkatkan pengklasifikasian otomatis dan tanpa pendefinisian fitur. Sedangkan pada penelitian ini menggunakan menggunakan status internal (memori) untuk LSTM dengan word embedding dari deep learning memproses urutan masukan. Ini membuatnya dapat yaitu Word2vec dengan membuat percobaan delapan diterapkan untuk tugas-tugas seperti pemrosesan bahasa model tuning LSTM untuk mendapatkan model yang optimal pada pengklasifikasian teks. Tanh adalah fungsi aktivasi di lapisan tersembunyi dan Softmax di lapisan keluaran

Long Short-Term Memory

Model Pelatihan

Model 3

Model 2

Model 5

Model 7

Model 8

Kesimpulan