Indexing labeled sequences.

Tatiana Rocher,Mathieu Giraud,Mikaël Salson

doi:10.7717/peerj-cs.148

Tatiana Rocher, Mathieu Giraud + Show 1 more

Open Access

PDF Available

https://doi.org/10.7717/peerj-cs.148

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundLabels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies.MethodsWe present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TLBW-index). Both indexes need a space related to the entropy of the labeled text.ResultsThese indexes allow efficient text–label queries to count and find labeled patterns. The TLBW-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts.DiscussionNew indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies.

Highlights

Labels are a way to add some information on a text, as the semantics of words on an English sentence or functional annotations such as genes on a DNA sequences
We introduce two indexes which store a labeled text and answers to position–label association queries. Those indexes share some ideas with the RL-FMI (Makinen & Navarro, 2004) which uses a Burrows–Wheeler transform (BWT) and a Wavelet Tree (WT)
TL-index: indexing labels over a text Given a labeled text (T, A), we define the TL-index as, using a FM-index built on a BWT U to index the text, a bit vector BA marking the positions in the text where the labels change, and a WT WA indexing a compressed label sequence (Fig. 2A)

Summary

Introduction

Labels are a way to add some information on a text, as the semantics of words on an English sentence or functional annotations such as genes on a DNA sequences. We introduce two indexes which store a labeled text and answers to position–label association queries. Those indexes share some ideas with the RL-FMI (Makinen & Navarro, 2004) which uses a Burrows–Wheeler transform (BWT) and a Wavelet Tree (WT). Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TLBW-index). Discussion: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies

Methods

Results

Conclusion