Abstract

The Next-Generation Sequencing (NGS) platforms produce massive amounts of data to analyze various features in environmental samples. These data contain multiple duplicate reads which impact the analyzing process efficiency and accuracy. We describe Fast-HBR, a fast and memory-efficient duplicate reads removing tool without a reference genome using de-novo principles. It uses hash tables to represent reads in integer value to minimize memory usage for faster manipulation. Fast-HBR is faster and has less memory footprint when compared with the state of the art De-novo duplicate removing tools. Fast-HBR implemented in Python 3 is available at https://github.com/Sami-Altayyar/Fast-HBR.

Highlights

  • The number of the publicly available Next-Generation Sequencing (NGS) projects tripled from 1200 in 2017 to 3500 in 2020 [1,2]

  • Results obtained using Fast-HBR is tabulated in Table 2, Table 3 and Table 4

  • We run the tools on King Abdulaziz University's High Performance Computing Center (Aziz Supercomputer), where all tools run on normal nodes which equipped with 24 processors and 96GB memory

Read more

Summary

Introduction

The number of the publicly available NGS projects tripled from 1200 in 2017 to 3500 in 2020 [1,2]. Preprocessing of data is essential to reduce the size of the data with an adequate level of data quality [3]. One of the preprocessing steps that reduce the dataset size is removing duplicate reads in the dataset. This step is essential for sequence-based algorithms since duplicate reads affect the algorithm accuracy [4]. Duplicate reads removal tools are either reference based or de novo. Available de novo tools include NGS Reads Treatment [9], Nubeam-

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.