Abstract

Sequences formed by symbols are found in diverse fields, including genome sequences, written texts and computer codes. An interesting question is whether a sequence of symbols contains correlated structures. Existing methods to characterize correlations require a numerical representation of the sequence. In this regard, mapping a sequence of text into a sequence of numerical values is a key step for assessing correlation analysis. This work proposes a methodology to study correlations in a sequence of symbols. In the first step, the sequence of symbols is mapped in a multivariate numerical sequence formed by unit vectors in a vectorial space. The main feature of such mapping is that symbols are equally weighted, thus avoiding the numerical overrepresentation of symbols. In the second step, a multivariate version of the detrended fluctuation analysis is used to quantify correlations in the numerical sequence. Genome sequences (first COVID-19), written English texts and comovements between Bitcoin and gold markets were used to illustrate the proposed methodology’s performance. The results showed that the balanced numerical mapping of symbolic sequences and the multivariate DFA provides valuable insights into the correlations in a sequence of symbols.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call