New checksum functions for Biopython

Sebastian Bassi,Virginia Gonzalez

doi:10.1038/npre.2007.278.1

Abstract

AbstractChecksum algorithms are used in biological databases for integrity check and identification purposes. CRC64 is the only checksum algorithm already included in Biopython. This work proposes two new implementation of known algorithms (GCG Checksum and SEGUID). There is also an application based on SEGUID: Looking for redundancy between two FASTA files full of protein sequences based only in sequence information, by comparing the SEGUIDs of both files.The code is shown in the manuscript and may be available at Biopython.org.

Highlights

Gcg24.py def gcg(seq): from itertools import cycle, izip return sum(n*ord(c.upper()) for (n,c) in\ izip(cycle(range(1,58)),seq)) % 10000 gcg24.py is a module that is imported from the main program when Python version is ≥2.4. It can't be included in main program because this code can't be parsed under Python 2.3
“We propose the use of a unique sequence identifier (SEGUID) that is derived from the primary sequence itself and generated by any user
SEGUIDs are resilient to changes in public and private databases as they remain constant throughout the lifetime of a given protein sequence

Summary

Introduction

SEGUID and GCG Checksum: New checksum algorithms for Biopython Anyone can later perform the same operation on the data, compare the result to the authentic checksum, and (assuming that the sums match) conclude that the message was probably not corrupted.” Why using a checksum for biological sequences? There are different reasons: Data integrity validation: To be sure you are dealing with the same sequence after extensive manipulation or retrieving from multiples sources.

Results

Conclusion