A STUDY OF SUMMARIZATION TECHNIQUES IN ALBANIAN LANGUAGE

Endri Xhina,Ilia Ninka,Roland Vasili,Thomas Souliotis

doi:10.35120/kij28072251r

Abstract

In recent years, technology has developed a lot and has revolutionized our perspective of the world. Technology and more precisely digital technology has created amazing tools, giving immediate access to anyone interested to any information he may need. This digital revolution of all media like computers, smartphones, etc. has produced a huge amount of digital data to be handled. In our research we care about one aspect of this data, the text data, and the way we can efficiently handle text and produce meaningful summaries. Thus, it is only until recently that text mining has become an interesting research field due to this vast increase of text volume on the web. However, because of its size, this text volume should be summarized so as to get all the useful information efficiently and without trying to deal with all of the initial text, which could be impractical in many cases. Therefore, text summarization systems are among the most attractive research areas nowadays. Text summarization is the process of finding the main source of information, extracting the main important contents and presenting them as a concise text in the predefined template. The two main summarization techniques available are Extractive and Abstractive, with a lot of research being carried out in these areas, especially in extractive summarization. However, meaningful summaries are obtained using abstractive techniques which are more complex, due to the nature of this technique which requires the summary to be constructed in an abstract way without using sentences from the original text, while in the extractive case the summary consists of sentences from the original text. In this paper there is a theoretical approach where the widely used summarization techniques are described at a first level. Moreover, these techniques are then put into practice focusing only on the Albanian language, since the language is an important factor which might lead to different outcomes for each algorithm, due to its structure, its form and its rules. This is the first attempt in the field of summarization in Albanian language and there is a high need for future research works in this area. This paper investigates various proposed text summarization methods which are usually used in English (and possibly other widely used) languages, comparing them and concluding which method is suitable for summarizing documents in the Albanian language. We analyze various summarization algorithms and provide a formal way of verifying the correctness of our results, by using different metrics (e.g. ROUGE) to evaluate the summaries’ accuracy of each technique, by utilizing some gold standard summaries, which have been produced by linguistic experts. Finally, we will also provide the whole practical implementation of this work either by uploading it to a github repository so as to be publicly accessible by anyone or by providing our services as micro-services through a web-page.

Full Text