Representation of structured data of the text genre as a technique for automatic text processing

Claudia Aparecida Fonseca,Marcus Vinícius Carvalho Guelpeli,Rafael Santiago De Souza Netto

doi:10.35699/1983-3652.2022.35445

Claudia Aparecida Fonseca, Marcus Vinícius Carvalho Guelpeli + Show 1 more

Open Access

https://doi.org/10.35699/1983-3652.2022.35445

Copy DOI

Abstract

The present article was developed in the field of Natural Language Processing and Language Studies based on a corpus compiled by computational tools. This study is based on the assumption that it is helpful to trace a close relationship between corpus generation/annotation and the assessment of the constitutive elements of the text genre source. It aims to demonstrate, through specific studies of structured data from the text genre ‘scientific article’, alternatives to automatic text processing techniques. In order to reach the intended goal, the authors created a computational model for the compilation of a linguistic, specialized Corpus, representative of the genre Scientific Article - CorpACE. The object of study includes the constitutive elements of scientific articles, marked in XML, extracted and collected from the SciELO-Scientific Electronic Library On-line database. The final product was a database obtained with information extracted and structured in XML format, which designates and identifies the markups of the genre being analyzed and is available for many tools and applications. The results demonstrate how the representation of constitutive elements of the genre can condense available information with hierarchical and dynamic processes built during the compilation. At the end of the study, it is believed that more research will be required for bringing Language Science and Computer Science closer with emphasis on NLP in the attempt to represent and manipulate linguistic knowledge in its many levels – morphological, syntactic, semantic and discursive – in order to improve implementation and manipulation of automatic text processing.

Highlights

After the computer and media revolution, electronic documents have become one of the most read scholarly and informational media for much of the world’s population due to widespread internet use
The compilation of these files generated the CorpACE, which is characterized as specialized corpora, since it is composed by texts from a single area of expertise – educational, and representative of a single text genre – scientific article
CorpACE underwent computational treatment through Corpus Linguistics (CL), which is methodologically present in this research by means of the AnoTex tool that generated the data from Table 3

Summary

Introduction

After the computer and media revolution, electronic documents have become one of the most read scholarly and informational media for much of the world’s population due to widespread internet use. This study is interesting, from a practical point of view – as it helps the user employ computer tools to access, retrieve and use increasing amounts of information online – and from a scientific-theoretical point of view, since it requires deep understanding of natural language by machines This knowledge is associated with processes such as reading, comprehension, presentation, evaluation and production of texts. Besides semantic and morpho-syntactic aspects, verbal language involves discourse processes that can transmit information about, for instance, time and space All of this creates a challenge for fields of knowledge such as Language Science and Computer Science with an emphasis on NLP, in order to create standardized representations for metadata that need to be normalized and standardized in a format that the computer can understand. This type of knowledge can be used by researchers to develop new pedagogical strategies for the use and application of structured language resources

Natural Language Processing

Main NLP techniques

Correlated studies

Computational model – AnoTex

Results and discussion

Conclusion