Abstract

With the development of the Human Genome Project, more and more biological sequence data are generated, and the analysis and processing of these sequence data have promoted the development of bioinformatics. Sequence similarity analysis is the basis of bioinformatics, through which we can use the information of known sequences to study the structure, function and evolutionary relationship of unknown new sequences. This paper performs data compression and retrieval on the genome database based on the dbSNP information of DNA. According to the rule of determining a protein by three bases, the amino acid characters are determined, and the redundant information is removed by using the dbSNP information. It is the first time to propose the construction of a new compressed form of biological sequence structure, which can reflect the strong correlation between the SNP location information and SNP in each sample in the genome. Finally, this paper constructs a complete biological sequence approximate neighbor query system, which can not only greatly reduce the storage and computing overhead, but also improve the query efficiency under the condition of ensuring the retrieval accuracy. The accuracy and scalability of this method are verified by experiments on a large data set of gene database.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.