Abstract

BackgroundAn inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets.ResultsWe present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats.ConclusionWithin the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.

Highlights

  • An inverted repeat is a Deoxyribonucleic acid (DNA) sequence followed downstream by its reverse complement, potentially with a gap in the centre

  • IUPACpal is run with the following terminal command:

  • Output is given in an identical format to that of The European molecular biology open software suite (EMBOSS), in which all the discovered inverted repeat (IR) are identified by their index locations (1-based indexing) alongside their symbol representation

Read more

Summary

Background

Context An inverted repeat (IR) is a single stranded sequence of nucleotides with a subsequent downstream sequence consisting of its reverse complement [1]. This illustrates the most commonly used diagrammatic representation of IRs. IUPAC matching schemes The International Union of Pure and Applied Chemistry (IUPAC) encoding is an extended alphabet + of symbols [20], which provides a single symbol representation for every one of the 15 possible nonempty subsets of the standard 4-symbol DNA alphabet = {A, C, G, T}. Algorithm Our algorithm exhaustively identifies all IRs by examining each position within a sequence and determining every valid IR with its centre at that position which adheres to the given input parameters This process first makes use of the kangaroo method to create a function with the ability to identify the longest matching prefix of any two substrings of a string [25, 26]. The algorithm maintains efficiency by calculating only the necessary mismatch locations needed for a given set of parameters, and no more

Results
20 GARGC 16
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.