Performance of Machine Learning Algorithms on Automatic Summarization of Indonesian Language Texts
Automatic text summarization (ATS) has become an essential task for processing huge amounts of information efficiently. ATS has been extensively studied in resource-rich languages like English, but research on summarization for under-resourced languages, such as Bahasa Indonesia, is still limited. Indonesian presents unique linguistic challenges, including its agglutinative structure, borrowed vocabulary, and limited availability of high-quality training data. This study conducts a comparative evaluation of extractive, abstractive, and hybrid models for Indonesian text summarization, utilizing the IndoSum dataset which contains 20,000 text-summary pairs. We tested several models including LSA (Latent Semantic Analysis), LexRank, T5, and BART, to assess their effectiveness in generating summaries. The results show that the LexRank+BERT hybrid model outperforms traditional extractive methods, achieving better ROUGE precision, recall, and F-measure scores. Among the abstractive methods, the T5-Large model demonstrated the best performance, producing more coherent and semantically rich summaries compared to other models. These findings suggest that hybrid and abstractive approaches are better suited for Indonesian text summarization, especially when leveraging large-scale pre-trained language models.
- Research Article
- 10.1088/1757-899x/1098/3/032041
- Mar 1, 2021
- IOP Conference Series: Materials Science and Engineering
Text summarization has important role in natural language processing. One of text summarization type is extractive summarization. Research on text summarization in Indonesian Language is still rare and not evaluated comprehensively. Each research is only conducted based on subjectivity of researcher. This paper reviewed and evaluated some works on Indonesian Language Text Summarization for obtaining the better method by analysing some aspects. This review also mapped Indonesian text summarization evaluation techniques and obtained its advantages and drawbacks. This research aims to provide a comprehensive review of text summarization in Indonesian Language. Result of this study is a comparative review of some works which showed detailed aspects in summarization method.
- Research Article
1
- 10.46799/jsa.v5i9.1483
- Sep 14, 2024
- Jurnal Syntax Admiration
The aim of this study is to evaluate how effective the Lexrank algorithm and Latent semantic analysis (LSA) are in automatic text summarization for the Indonesian language. This research focuses on natural language processing and handling of excessive data. We applied both algorithms to generate text summaries using the INDOSUM dataset, which contains about 20,000 news articles in Indonesian with manual summaries. To assess performance, the ROUGE metric was used, which includes aspects of precision, recall, and F1 score. In all tested metrics, LSA outperformed Lexrank. LSA had a precision of 0.57, recall of 0.67, and an F1 score of 0.59, whereas Lexrank had a precision of 0.46, recall of 0.52, and an F1 score of 0.48. These result indicate that LSA is better at gathering important information from the original text than Lexrank.
- Research Article
- 10.59188/eduvest.v5i2.1663
- Feb 20, 2025
- Eduvest - Journal of Universal Studies
The aim of this study is to evaluate how effective the Lexrank algorithm and Latent semantic analysis (LSA) are in automatic text summarization for the Indonesian language. This research focuses on natural language processing and handling of excessive data. We applied both algorithms to generate text summaries using the INDOSUM dataset, which contains about 20,000 news articles in Indonesian with manual summaries. To assess performance, the ROUGE metric was used, which includes aspects of precision, recall, and F1 score. In all tested metrics, LSA outperformed Lexrank. LSA had a precision of 0.57, recall of 0.67, and an F1 score of 0.59, whereas Lexrank had a precision of 0.46, recall of 0.52, and an F1 score of 0.48. These result indicate that LSA is better at gathering important information from the original text than Lexrank.
- Research Article
- 10.23917/khif.v9i2.21495
- Oct 29, 2023
- Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika
The great challenge in Indonesian automatic text summarization research is producing readable summaries. The quality of text summary can be reached if the meaning of the text can be maintained properly. As a result, the purpose of this study is to improve the quality of extractive Indonesian automatic text summarization by taking into account the quality of structured text representation. This study employs Sequential Pattern Mining (SPM) to generate a sequence of words as a structured representation of text and a graph-based approach to generate automatic text summarization. The SPM algorithm used is PrefixSpan, and the graph-based approach uses the Bellman-Ford algorithm. The results of an experiment using the IndoSum dataset show that combining SPM and Bellman-Ford can improve the precision, recall, and f-measure of ROUGE-1, ROUGE-2, and ROUGE-L. When Bellman-Ford is combined with SPM, the F-measure of ROUGE-1 increases from 0.2299 to 0.3342. The ROUGE-2 f-measure increases from 0.1342 to 0.2191, and the ROUGE-L f-measure increases from 0.1904 to 0.2878. This result demonstrates that SPM can improve the performance of the Bellman-Ford algorithm in producing Indonesian text summaries.
- Book Chapter
1
- 10.1007/978-981-19-4863-3_17
- Oct 28, 2022
In this paper, we are proposing a hybrid model of latent semantic analysis with graph-based xtractive text summarization on Telugu text. Latent semantic analysis (LSA) is an unsupervised method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a corpus of text. Text rank algorithm is one of the graph-based ranking algorithm which is based on the similarity scores of the sentences. This hybrid method has been implemented on Eenadu Telugu e-news data. The ROUGE-1 measures are used to evaluate the summaries of proposed model and human-generated summaries in this extractive text summarization. The proposed LSA with Text rank method has a F1-score of 0.97 as against the F1-score of 0.50 for LSA and 0.49 of Text rank methods. The hybrid model yields better performance compared with the individual algorithms of latent semantic analysis and Text rank results.KeywordsText summarizationLatent semantic analysisText rank algorithmSingular value decompositionTelugu language
- Book Chapter
5
- 10.1007/978-981-16-5157-1_60
- Oct 26, 2021
Choosing relevant information from a giant source of data available online is a difficult and challenging task. Automatic summarization can address this challenge. Summarization is the task of condensing a chunk of text to a shorter version, which reduces the size of the initial text and simultaneously preserves the meaning of content. This model proposes automatic text summarization based on the reinforcement learning method and uses a deep learning network to estimate Q value. Here, we use rouge to analyze the performance of our model. ROUGE is used for evaluating automatic text summarization. The three phases of the project are text processing, text formation, and text evaluation. In text processing, select a set of sentences using latent semantic analysis and form a summary using reinforcement and deep Q network. And then, the summary is evaluated using rouge. KeywordsReinforcement learningDeep learning networkLatent semantic analysisRouge
- Research Article
18
- 10.1109/access.2023.3238570
- Jan 1, 2023
- IEEE Access
The production of agricultural products and the high yield in these products are of critical importance for the continuation of human life. In recent years, machine learning and deep learning technologies have been widely used in determining agricultural productivity. The purpose of this study was to estimate the yield of apple fruit by using a novel deep learning-based hybrid method. First, by using images belonging to the golden and royal gala apple varieties, a classification was made with the help of a convolutional neural network (CNN) that was designed for the study. Then, using classical machine learning algorithms and bagging and boosting algorithms, a hybrid application was performed by classifying the images whose feature extractions were done with the designed CNN. The results of the study, presented on 4 separate datasets (Datasets A, B, C, and D), were evaluated based on accuracy, precision, recall, F-measure, and Cohen kappa scores. Considering the accuracy results for Datasets B, C, and D, it was determined that the hybrid model that gave the best result was the CNN-SVM model. For Dataset A, the CNN-SVM and CNN-Gradient Boosting hybrid models gave the best and same accuracy. Dataset C was determined as the most appropriate dataset in terms of the more balanced distribution of train, test, and validation size in the datasets, the results of the proposed hybrid CNN model, and the evaluation of the results of the model. For Dataset C, it was found that the accuracy of the hybrid model was 99.70%. Precision, recall, f-measure, and Cohen kappa scores were 99%. The results of the study revealed that the hybrid models showed effective results in determining the productivity of apple fruit through images belonging to the golden and royal gala varieties.
- Conference Article
8
- 10.1109/eiconcit50028.2021.9431880
- Apr 9, 2021
Automatic text summarization systems are increasingly needed to encounter the information explosion caused by internet growth. Since Indonesian is still considered an under-resourced language, we take advantage of pre-trained language models to perform abstractive summarization. This paper investigates the BERT performance given the Indonesian article by comparing several BERT pre-trained models and evaluated the results based on the ROUGE values. Our experiment shows that an English pre-trained model can produce a good summary given Indonesian text, but it is more effective for using the Indonesian pre-trained model. The default training model only with the abstractive objective is better than using two-stage fine-tuning, where the extractive model must be trained in advance. We also found a lot of meaningless words in the summary words construction. This finding is the result of a preliminary study to improve the Indonesian abstractive summarization model.
- Conference Article
- 10.1109/icscds56580.2023.10105055
- Mar 23, 2023
Automatic Text Summarization (ATS) is a process of extracting few text portions from the given document. Such Text can also contain sentence that express the overall meaning of the document. Most of the research work in this field is carried out for Languages like English and Chinese with high accuracy. In this proposed work, hybrid model is used that combines Keyword-based score, sentiment score and Text Ranking-based score to perform Automatic text summarization on Tamil Language. To evaluate the performance of the proposed model, dataset that has been formulated from the scratch by using two reputed Tamil newspapers is considered. The Dataset have been categorized into four classes as Sports, Cinema, Astrology and Politics. The hybrid model increases the accuracy of Tamil text summarization when compared with previous research work. By using the proposed model average accuracy of around 0.81 as Recall score, 0.61 as Precision score and 0.67 as F score is achieved.
- Book Chapter
2
- 10.1007/978-981-19-2821-5_37
- Sep 27, 2022
Automatic text summarization involves extracting relevant details from the contents of input text documents for generating summaries. This area of Natural Language Processing is widely researched, especially with popular languages like English. There is a need to extend this work to less commonly spoken languages of the world. This paper presents a language-independent text summarization approach using Latent Semantic Analysis in Konkani language. Konkani is a low-resource language with limited language processing tools, stop-word list, etc. Latent Semantic Analysis (LSA) is an unsupervised algebraic method that finds latent semantic structures to be used for performing extractive text summarization. We examined well-known Latent Semantic Analysis-based sentence selection approaches on our dataset, constructed using books on Konkani folk tales written in Devanagari script. The results of the experiments indicated that LSA-based approaches can produce promising summaries, with the Cross method performing the best in most metrics.KeywordsAutomatic text summarizationLatent semantic analysisKonkaniLow-resourceSingular value decompositionExtractive text summarization
- Conference Article
3
- 10.1109/isai-nlp.2018.8692976
- Nov 1, 2018
Due to increasing availability of online information, tools and mechanisms for automatic summarization of documents is needed. Text summarization is currently a major research topic in Natural Language Processing. There are various approaches to generate text summary. Among them, we proposed Myanmar text summarization using latent semantic analysis (LSA). Latent semantic analysis (LSA) is a technique in natural language processing,and can analyze relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. It is an unsupervised approach which does not need any traning or external knowledge. There is no LSA based sentence extraction in Myanmar language. This is the first LSA based Text Summarizer in Myanmar. This paper present generic, extractve and single-document Myanmar text summarization using latent semantic analysis. This paper compare two sentence selection methods (Steinberger and Jezek's approach and Ozay approach) of latent semantic analysis to extract important sentences. We summarize Myanmar news from Myanmar official websites such as 7day daily, iyarwaddy, etc.
- Research Article
- 10.1504/ijisc.2020.10029282
- Jan 1, 2020
- International Journal of Intelligence and Sustainable Computing
Increasing availability of information in the web and its ease of access necessitate the need for efficient and effective automatic text summarisation. Automatic text summarisation condenses the source text (a single document or multiple documents) into a compact version preserving its overall meaning and information content. Till now, researchers have employed different approaches for creating well-formed summaries. One of the most popular methods is the latent semantic analysis (LSA). In this paper, various prominent works to produce extractive and abstractive text summaries based on different variants of LSA algorithm are analysed.
- Research Article
1
- 10.3390/computers13100268
- Oct 12, 2024
- Computers
The majority of applications use automatic image recognition technologies to carry out a range of tasks. Therefore, it is crucial to identify and classify image distortions to improve image quality. Despite efforts in this area, there are still many challenges in accurately and reliably classifying distorted images. In this paper, we offer a comprehensive analysis of models of both non-lightweight and lightweight deep convolutional neural networks (CNNs) for the classification of distorted images. Subsequently, an effective method is proposed to enhance the overall performance of distortion image classification. This method involves selecting features from the pretrained models’ capabilities and using a strong classifier. The experiments utilized the kadid10k dataset to assess the effectiveness of the results. The K-nearest neighbor (KNN) classifier showed better performance than the naïve classifier in terms of accuracy, precision, error rate, recall and F1 score. Additionally, SqueezeNet outperformed other deep CNN models, both lightweight and non-lightweight, across every evaluation metric. The experimental results demonstrate that combining SqueezeNet with KNN can effectively and accurately classify distorted images into the correct categories. The proposed SqueezeNet-KNN method achieved an accuracy rate of 89%. As detailed in the results section, the proposed method outperforms state-of-the-art methods in accuracy, precision, error, recall, and F1 score measures.
- Conference Article
39
- 10.1109/icais50930.2021.9395976
- Mar 25, 2021
Document summarization is one such task of the natural language processing which deals with the long textual data to make its concise and fluent summaries that contains all of document relevant information. The Branch of NLP that deals with it, is automatic text summarizer. Automatic text summarizer does the task of converting the long textual document into short fluent summaries. There are generally two ways of summarizing text using automatic text summarizer, first is using extractive text summarizer and another abstractive text summarizer. This paper has demonstrated an experiment in contrast with the extractive text summarizer for summarizing the text. On the other hand topic modelling is a NLP task that extracts the relevant topic from the textual document. One such method is Latent semantic Analysis (LSA) using truncated SVD which extracts all the relevant topics from the text. This paper has demonstrated the experiment in which the proposed research work will be summarizing the long textual document using LSA topic modelling along with TFIDF keyword extractor for each sentence in a text document and also using BERT encoder model for encoding the sentences from textual document in order to retrieve the positional embedding of topics word vectors. The algorithm proposed algorithm in this paper is able to achieve the score greater than that of text summarization using Latent Dirichlet Allocation (LDA) topic modelling.
- Conference Article
2
- 10.1109/iac.2018.8780512
- Oct 1, 2018
Automatic text summarization is the process to shorten text document. There are various approaches explored by the researcher to achieve shorter text without losing the significance of the content. This occurrence is also taking place in Indonesia. Numerous researchers explore the best practice to implement the appropriate method to summarize text automatically. This paper attempts to provide an overview of recent research on automatic text summarization, especially in Bahasa Indonesia. It focuses on the single text document, research methodologies and the evaluation result discussion. In summary, there is room for improvement in automatic text summarization in Bahasa Indonesia because the evaluation results of the research are still average.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.