Abstract
Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as "common nodes" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of "compressive genomics", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.