Abstract

Grouping documents into categories is one of the solutions adopted to streamline the information retrieval process, which is increasingly relevant due to the large amount of information available today. The manual localization of documents of a specific theme, available in digital repositories, involves reading the title, abstract and keywords, in addition to further detailed evaluation in order to identify whether the publication belongs to the desired thematic axis. Considering the number of publications in a digital repository, manually locating all the desired texts on a given topic can be laborious and time-consuming. This research proposes an architecture for automatic classification of texts that is based on syntactic questions, that is, it undertakes a comparison of n-grams, which are combinations of n-pairs of words that are identified throughout the text. An exploratory applied research was carried out, which applied a type of supervised learning, fundamentally based on the document representation model called bag-of-words (BoW). The paper’s macro objective was to classify texts in general, according to pre-defined categories, by generating and comparing degrees of belonging between texts, as one of the key criteria. The results of these comparisons, using n-gram = 3, demonstrate that in the use of classifications by n-grams, the greater the number of grams, and with the removal of the stop words, we obtain a reduced degree of belonging, demonstrating greater rigor in identifying the match during the classification. In order to have greater confidence in the results, a larger training corpus is necessary to expand the number of words that characterize the pre-defined categories, to be used in the classification of the texts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call