Approximate String Matching Algorithm Research Articles

주기와 같은 반복문자열에 대한 연구는 데이터압축, 컴퓨터활용 음악분석, 바이오인포매틱스 등 다양한 분야에서 진행되고 있다. 바이오인포매틱스 분야에서 주기는 유전자 서열이 반복적으로 나타나는 종렬중복과 밀접한 관련이 있으며 이는 근사문자열매칭을 이용한 근사주기 연구와 관련이 있다. 본 논문에서는 기존의 근사주기에 대한 정의를 보완하는 거리합기반 근사주기를 정의하고 이에 대한 연구 결과를 제시한다. 길이가 각각 m과 n인 문자열 p와 x가 주어졌을 때, p의 x에 대한 거리합기반 최소 근사주기거리를 가중편집거리에 대해 <TEX>$O(mn^2)$</TEX> 시간, 편집거리에 대해 O)(mn) 시간, 해밍거리에 대해 O(n) 시간에 계산하는 알고리즘을 제시한다. Repetitive strings such as periods have been studied vigorously in so diverse fields as data compression, computer-assisted music analysis, bioinformatics, and etc. In bioinformatics, periods are highly related to repetitive patterns in DNA sequences so called tandem repeats. In some cases, quite similar but not the same patterns are repeated and thus we need approximate string matching algorithms to study tandem repeats in DNA sequences. In this paper, we propose a new definition of approximate periods of strings based on distance sum. Given two strings <TEX>$p({\mid}p{\mid}=m)$</TEX> and <TEX>$x({\mid}x{\mid}=n)$</TEX>, we propose an algorithm that computes the minimum approximate period distance based on distance sum. Our algorithm runs in <TEX>$O(mn^2)$</TEX> time for the weighted edit distance, and runs in O(mn) time for the edit distance, and runs in O(n) time for the Hamming distance.

Read full abstract

Multiple sequence alignment is a fundamental problem in computational biology. Because of its notorious difficulties, aligning sequences within a constant band ( c-diagonal) is a popular practice in bioinformatics with good practical results [D. Sankoff, J. Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison–Wesley, 1983; W.R. Pearson, D. Lipman, Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. USA 85 (1988) 2444–2448; W.R. Pearson, Rapid and sensitive sequence comparison with FASTP and FASTA, Methods Enzymol. 183 (1990) 63–98; W.R. Pearson, Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith–Waterman and FASTA algorithms, Genomics 11 (1991) 635–650; S. Altschul, D. Lipman, Trees, stars, and multiple sequence alignment, SIAM J. Appl. Math. 49 (1989) 197–209; K. Chao, W.R. Pearson, W. Miller, Aligning two sequences within a specified diagonal band, CABIOS 8 (1992) 481–487; J.W. Fickett, Fast optimal alignment, Nucleic Acids Res. 12 (1984) 175–180; E. Ukkonen, Algorithms for approximate string matching, Inform. Control 64 (1985) 100–118; J.L. Spouge, Fast optimal alignment, CABIOS 7 (1991) 1–7]. However, the problem is still NP-hard for multiple sequences. In this paper, we present a theoretical study of this problem. In particular, for arbitrarily small ϵ > 0 , we present polynomial time algorithms that produce a multiple alignment (not necessarily c-diagonal) with cost at most 1 + ϵ times the cost of the optimal c-diagonal alignment, under standard models of both SP alignment and consensus (star) alignment. Our algorithms for consensus alignment allow very general score schemes. In order to prove our main results, we also present similar results for SP alignment and consensus alignment, allowing only constant number of insertion and deletion gaps (of arbitrary length) per sequence on the average. These results are interesting in their own rights and they improve some results in [M. Li, B. Ma, L. Wang, Finding similar regions in many sequences, in: Proc. 31st ACM Symp. Theory of Computing, Atlanta, 1999, pp. 473–482].

Read full abstract

Approximate String Matching Algorithm Research Articles

Related Topics

Articles published on Approximate String Matching Algorithm

Mapping biological entities using the longest approximately common prefix method.

Approximate Chinese String Matching Techniques Based on Pinyin Input Method

An algorithm to improve the performance of string matching

Fast algorithms for approximate circular string matching

Approximate Multiple Pattern String Matching using Bit Parallelism: A Review

DNA 서열분석을 위한 거리합기반 문자열의 근사주기

A Consensus Algorithm for Approximate String Matching

A BIT-PARALLEL DYNAMIC PROGRAMMING ALGORITHM SUITABLE FOR DNA SEQUENCE ALIGNMENT

Fast Algorithms for Top-k Approximate String Matching

Approximate String Matching with Compressed Indexes

Identification of design motifs with pattern matching algorithms

Approximate Boyer-Moore String Matching for Small Alphabets

Improved algorithms for approximate string matching (extended abstract)

Efficient and effectiveness retrieval of information using some of the approximate string matching algorithms

Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

Near optimal multiple alignment within a band in polynomial time

A programmable array processor architecture for flexible approximate string matching algorithms

Introduction of New Associate Editors

Introduction of New Associate Editors

A software system for gene sequence database construction based on fast approximate string matching

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Approximate String Matching Algorithm Research Articles

Related Topics

Articles published on Approximate String Matching Algorithm

Mapping biological entities using the longest approximately common prefix method.

Approximate Chinese String Matching Techniques Based on Pinyin Input Method

An algorithm to improve the performance of string matching

Fast algorithms for approximate circular string matching

Approximate Multiple Pattern String Matching using Bit Parallelism: A Review

DNA 서열분석을 위한 거리합기반 문자열의 근사주기

A Consensus Algorithm for Approximate String Matching

A BIT-PARALLEL DYNAMIC PROGRAMMING ALGORITHM SUITABLE FOR DNA SEQUENCE ALIGNMENT

Fast Algorithms for Top-k Approximate String Matching

Approximate String Matching with Compressed Indexes

Identification of design motifs with pattern matching algorithms

Approximate Boyer-Moore String Matching for Small Alphabets

Improved algorithms for approximate string matching (extended abstract)

Efficient and effectiveness retrieval of information using some of the approximate string matching algorithms

Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching

Near optimal multiple alignment within a band in polynomial time

A programmable array processor architecture for flexible approximate string matching algorithms

Introduction of New Associate Editors

Introduction of New Associate Editors

A software system for gene sequence database construction based on fast approximate string matching