Highly efficient clustering of long-read transcriptomic data with GeLuster.

Junchi Ma,Guojun Li,Ting Yu,Xiaoyu Zhao,Enfeng Qi,Renmin Han

doi:10.1093/bioinformatics/btae059

Abstract

The advancement of long-read RNA sequencing technologies leads to a bright future for transcriptome analysis, in which clustering long reads according to their gene family of origin is of great importance. However, existing de novo clustering algorithms require plenty of computing resources. We developed a new algorithm GeLuster for clustering long RNA-seq reads. Based on our tests on one simulated dataset and nine real datasets, GeLuster exhibited superior performance. On the tested Nanopore datasets it ran 2.9-17.5 times as fast as the second-fastest method with less than one-seventh of memory consumption, while achieving higher clustering accuracy. And on the PacBio data, GeLuster also had a similar performance. It sets the stage for large-scale transcriptome study in future. GeLuster is freely available at https://github.com/yutingsdu/GeLuster.

Full Text