Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Bojan Nikolic,Milana Grbic,Günther Raidl,Marko Djukanovic,Aleksandar Kartelj,Christian Blum

doi:10.3390/math9131515

Abstract

The longest common subsequence (LCS) problem is a prominent NP–hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum. This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

Highlights

We introduce two new longest common subsequence (LCS) benchmark sets based on multinomial distributions, whose main property is that letters occur with different frequencies
According to the obtained results, a clear winner is Beam search (BS)-Geometric mean probability sum (G MPSUM), which obtains the best average solution quality for all six instance groups. This indicates that G MPSUM is clearly better as a search guidance than the other three heuristic functions for this benchmark set
We considered the prominent longest common subsequence problem with an arbitrary set of input strings

Summary

Introduction

A heuristic guidance that approximates the expected length of an LCS on uniform random strings was proposed This way, a new state-of-the-art BS variant that leads on most of the existing random and quasi-random benchmark instances from the literature was obtained. We are aware of just one benchmark set with different distributions (BB, see Section 4), where the input strings are constructed in such a way so that they exhibit high similarity, but still the letters’ frequencies are similar In practical applications, this assumption of uniform or close-to-uniform distribution of letters does not need to hold. The letter ‘N’ is very frequent in German (9.78%), but not so common in English (6.749%) and Russian (6.8%) Motivated by this consideration, we develop a new BS-based algorithm exhibiting an improved performance for instances with different string distributions. We introducing some commonly used notation before giving an overview of the remainder of this article

Preliminaries

Overview

Theoretical Aspects of Different String Distributions

Multinomial Distribution—Special Case 1

Multinomial Distribution—Special Case 2

Multinomial Distribution—Special Case 3

The Case of Independent Random Strings

Beam Search for Multinomially Distributed LCS Instances

Beam Search Framework

State Graph for the LCS Problem

Novel Heuristic Guidance

A Time-Restricted BS

Experimental Results

Benchmark Sets

Considered Algorithms

Tuning of Parameter λ

Summary of the Results

New State-of-the-Art Results for Instances from the Literature

Results for Benchmark Sets Poly and Bacteria

Statistical Significance of the So-Far Reported Results

Textual Corpus Case Study

Conclusions and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Jun 29, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

A Heuristic Approach for Solving the Longest Common Square Subsequence Problem
Marko Djukanovic ... Günther R Raidl
-
Marko Djukanovic, et. al.Marko Djukanovic ... Günther R Raidl
01 Jan 2020
01 Jan 2020

Anytime algorithms for the longest common palindromic subsequence problem
Günther R Raidl ... Marko Djukanovic
Computers and Operations Research | VOL. 114
Günther R Raidl, et. al.Günther R Raidl ... Marko Djukanovic
14 Oct 2019
Computers and Operations Research | VOL. 114

The set-set LCS problem
D S Hirschberg ... L L Larmore
Algorithmica | VOL. 4
D S Hirschberg, et. al.D S Hirschberg ... L L Larmore
01 Jun 1989
Algorithmica | VOL. 4

On the parameterized complexity of the repetition free longest common subsequence problem
Florian Sikora ... Paola Bonizzoni
Information Processing Letters | VOL. 112
Florian Sikora, et. al.Florian Sikora ... Paola Bonizzoni
23 Dec 2011
Information Processing Letters | VOL. 112

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics