Abstract

AbstractResearchers in the biotechnology field have accomplished many achievements in the past century. They can now measure expression levels for thousands of genes, testing different conditions over varying periods of time. The analysis of the measurement results is essential to understand gene patterns and extract information about their functions and their biological roles. This paper describes a novel approach for clustering large-scale next-generation sequences (NGS). It also facilitates the process of predicting patterns and the likelihood of mutations based on a semi-supervised clustering technique. The process is based on the previously developed construction of FuzzyFind Dictionary utilizing the Golay Code for error correction. The introduced method is exceptional; it has linear time complexity with one passage through the file.

Highlights

  • Researchers have generated overwhelming amounts of gene expression data

  • This paper presents a novel approach for clustering massive amount of next-generation sequences (NGS) stream by utilizing the Golay Code Clustering algorithm (GCC)

  • This paper presents a clustering technique that is based on a reverse of the traditional error-correction scheme using the perfect Golay code described in [5] and [6]

Read more

Summary

Introduction

Researchers have generated overwhelming amounts of gene expression data. The pace only gets faster and data grows rapidly. Comprehending gene expression data is a fundamental step in understanding human ancestry, diseases, and their interaction with environmental conditions. It can result in developing new medicines and treatments for disease. We have successfully accommodated the pace at which gene expression data is generated storage-wise, our human brains are not capable of understanding these amounts of raw data. Clustering of next-generation sequences is a rather complicated computational problem to study the large-scale data of DNA and RNA molecules. There are many advantages of gene expression clustering: it allows scientists and researchers to study data without studying each individual gene, it allows data visualization, and it helps scientists to figure out roles for unknown genes in the same cluster as well as reduce the redundancies in NGS data

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call