Abstract

The longest common subsequence (LCS) problem is a prominent NP–hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum. This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

Highlights

  • We introduce two new longest common subsequence (LCS) benchmark sets based on multinomial distributions, whose main property is that letters occur with different frequencies

  • According to the obtained results, a clear winner is Beam search (BS)-Geometric mean probability sum (G MPSUM), which obtains the best average solution quality for all six instance groups. This indicates that G MPSUM is clearly better as a search guidance than the other three heuristic functions for this benchmark set

  • We considered the prominent longest common subsequence problem with an arbitrary set of input strings

Read more

Summary

Introduction

A heuristic guidance that approximates the expected length of an LCS on uniform random strings was proposed This way, a new state-of-the-art BS variant that leads on most of the existing random and quasi-random benchmark instances from the literature was obtained. We are aware of just one benchmark set with different distributions (BB, see Section 4), where the input strings are constructed in such a way so that they exhibit high similarity, but still the letters’ frequencies are similar In practical applications, this assumption of uniform or close-to-uniform distribution of letters does not need to hold. The letter ‘N’ is very frequent in German (9.78%), but not so common in English (6.749%) and Russian (6.8%) Motivated by this consideration, we develop a new BS-based algorithm exhibiting an improved performance for instances with different string distributions. We introducing some commonly used notation before giving an overview of the remainder of this article

Preliminaries
Overview
Theoretical Aspects of Different String Distributions
Multinomial Distribution—Special Case 1
Multinomial Distribution—Special Case 2
Multinomial Distribution—Special Case 3
The Case of Independent Random Strings
Beam Search for Multinomially Distributed LCS Instances
Beam Search Framework
State Graph for the LCS Problem
Novel Heuristic Guidance
A Time-Restricted BS
Experimental Results
Benchmark Sets
Considered Algorithms
Tuning of Parameter λ
Summary of the Results
New State-of-the-Art Results for Instances from the Literature
Results for Benchmark Sets Poly and Bacteria
Statistical Significance of the So-Far Reported Results
Textual Corpus Case Study
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call