Text classification is a mainly investigated challenge in Natural Language Processing (NLP) research. The higher performance of a classification model depends on a representation that can extract valuable information about the texts. Aiming not to lose crucial local text information, a way to represent texts is through flows, sequences of information collected from texts. This paper proposes an approach that combines various techniques to represent texts: the representation by flows, the benefit of the word embeddings text representation associated with lexicon information via semantic similarity distances, and the extraction of features inspired by well-established audio analysis features.In order to perform text classification, this approach splits the text into sentences and calculates a semantic similarity metric to a lexicon on an embedding vector space. The sequence of semantic similarity metrics composes the text flow. Then, the method performs the extraction of twenty-five features inspired by audio analysis (named Audio-Like Features). The features adaptation from audio analysis comes from a similitude between a text flow and a digital signal, in addition to the existing relationship between text, speech, and audio. We evaluated the method in three NLP classification tasks: Fake News Detection in English, Fake News Detection in Portuguese, and Newspaper Columns versus News Classification. The approach efficacy is compared to baselines that embed semantics in text representation: the Paragraph Vector and the BERT. The objective of the experiments was to investigate if the proposed approach could compete with the baselines methods improve their efficacy when associated with them. The experimental evaluation demonstrates that the association between the proposed and the baseline methods can enhance the baseline classification efficacy in all three scenarios. In the Fake News Detection in Portuguese task, our approach surpassed the baselines and obtained the best effectiveness (PR-AUC = 0.98).
Read full abstract