Abstract

Searching genomic interval sets produced by sequencing methods has been widely and routinely performed; however, existing metrics for quantifying similarities among interval sets are inconsistent. Here we introduce Seqpare, a self-consistent and effective metric of similarity and tool for comparing sequences based on their interval sets. With this metric, the similarity of two interval sets is quantified by a single index, the ratio of their effective overlap over the union: an index of zero indicates unrelated interval sets, and an index of one means that the interval sets are identical. Analysis and tests confirm the effectiveness and self-consistency of the Seqpare metric.

Highlights

  • Functional genomic data are often summarized as interval sets and deposited in public repositories (e.g., UCSC, ENCODE, Roadmap, GEO, SRA etc.)

  • To compare a query interval set with multiple interval sets in a genomic sequence database, searching tools LOLA (Sheffield & Bock, 2016) and GIGGLE (Layer et al, 2018) calculate two values — Fisher’s exact p-value and the odds-ratio based on the total number of intersections — and use them as the similarity score to rank the search results

  • The Fisher’s exactbased metrics require two values (p-value and odds-ratio) but neither is a direct measurement of the similarity: p-values are sensitive to the total number of regions and can range as low as 10-200 for large genomic interval sets, and odds-ratios are sensitive to small numbers; and neither metric directly informs on how similar the two sets are

Read more

Summary

Introduction

Functional genomic data are often summarized as interval sets and deposited in public repositories (e.g., UCSC, ENCODE, Roadmap, GEO, SRA etc.). To compare a query interval set with multiple interval sets in a genomic sequence database, searching tools LOLA (Sheffield & Bock, 2016) and GIGGLE (Layer et al, 2018) calculate two values — Fisher’s exact p-value and the odds-ratio based on the total number of intersections — and use them as the similarity score to rank the search results. These similarity metrics have proven useful for determining relationships among interval sets, and have some flaws. To overcome these weaknesses of the Fisher’s exact-based metrics, we developed Seqpare, a self-consistent metric for quantifying the similarity among genomic interval sets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call