A Simplified Description of Child Tables for Sequence Similarity Search.

Martin C Frith,Anish M S Shrestha

doi:10.1109/tcbb.2018.2796064

Abstract

Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find "seed" matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.

Highlights

SEQUENCE similarity search remains a fundamental and incompletely-solved task in bioinformatics
We start by getting the topmost element of the child table, in this case 6, which points to the location indicated by –6– in the suffix array
The recursive definition of a child table just described is performed by the algorithm in Fig. 5, which should be invoked for the outermost interval like this: makeChildTable(0, suffixArray.length, 0)

Summary

A Simplified Description of Child Tables for Sequence Similarity Search

Abstract—Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find “seed” matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply to all the above-mentioned seed types and more.

INTRODUCTION

THE SEED-AND-EXTEND APPROACH

Spaced Seeds

Subset Seeds

Variable-Length Seeds

Sparse Seeds

Seed Summary

ARRAY AND RANGE CONVENTIONS

INDEXING

SUFFIX ARRAYS

Binary Search in Suffix Arrays

Suffix Array Generalizations

CHILD TABLES

LCP Arrays

Child Table Definition

Relationship to Tree Data Structures

Child Table Search

Search Algorithm Variants

REMARKS ON CONSTRUCTION

ALTERNATIVES

A Compact Child Table

Multiple Seed Patterns

Subset Seeding for Bisulfite-Converted DNA

CPU Cache Misses

Cache-Friendly Layouts

10 CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on computational biology and bioinformatics	Publication Date: Feb 9, 2018
Citations: 43	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

A Simplified Description of Child Tables for Sequence Similarity Search.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on computational biology and bioinformatics

Lead the way for us

Similar Papers

String Inference from Longest-Common-Prefix Array
...
-
, et. al. ...
31 Jan 2018
31 Jan 2018

Efficient Substring Discovery Using Suffix, LCP Array and Algorithm-Architecture Interaction
Anindya Poddar
-
Anindya PoddarAnindya Poddar
10 Jun 2022
10 Jun 2022

Parallel distributed memory construction of suffix and longest common prefix arrays
Patrick Flick ... Srinivas Aluru
-
Patrick Flick, et. al.Patrick Flick ... Srinivas Aluru
15 Nov 2015
15 Nov 2015

Gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections
Felipe A Louza ... Guilherme P Telles
Algorithms for Molecular Biology | VOL. 15
Felipe A Louza, et. al.Felipe A Louza ... Guilherme P Telles
22 Sep 2020
Algorithms for Molecular Biology | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Simplified Description of Child Tables for Sequence Similarity Search.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on computational biology and bioinformatics