Abstract
Accurate modeling of DNA sequences requires capturing distant semantic relationships between the nucleotide acid bases. Most existing deep neural network models face two challenges: (1) they are limited to short DNA fragments and cannot capture long-range interactions, and (2) they require many supervised labels, which is often expensive in practice. We propose a new neural network model called SwanDNA to address the above challenges. By using a sparse and wide network architecture, our model enables inferences over very long DNA sequences. By incorporating the neural network into a self-supervised learning framework, our method can give accurate predictions while using less supervised labels. We evaluate SwanDNA in three DNA sequence inference tasks, human variant effect, open chromatin regions detection in plant genes, and GenomicBenchmarks. SwanDNA outperforms all competitors in the first two tasks and achieves state-of-art in seven of eight datasets in GenomicBenchmarks. Our code is available at https://github.com/wiedersehne/SwanDNA.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have