Abstract

Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find "seed" matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.

Highlights

  • SEQUENCE similarity search remains a fundamental and incompletely-solved task in bioinformatics

  • We start by getting the topmost element of the child table, in this case 6, which points to the location indicated by –6– in the suffix array

  • The recursive definition of a child table just described is performed by the algorithm in Fig. 5, which should be invoked for the outermost interval like this: makeChildTable(0, suffixArray.length, 0)

Read more

Summary

A Simplified Description of Child Tables for Sequence Similarity Search

Abstract—Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find “seed” matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related data structure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply to all the above-mentioned seed types and more.

INTRODUCTION
THE SEED-AND-EXTEND APPROACH
Spaced Seeds
Subset Seeds
Variable-Length Seeds
Sparse Seeds
Seed Summary
ARRAY AND RANGE CONVENTIONS
INDEXING
SUFFIX ARRAYS
Binary Search in Suffix Arrays
Suffix Array Generalizations
CHILD TABLES
LCP Arrays
Child Table Definition
Relationship to Tree Data Structures
Child Table Search
Search Algorithm Variants
REMARKS ON CONSTRUCTION
ALTERNATIVES
A Compact Child Table
Multiple Seed Patterns
Subset Seeding for Bisulfite-Converted DNA
CPU Cache Misses
Cache-Friendly Layouts
10 CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.