The threshold q-gram distance: a simple, efficient, and effective distance measure for genomic sequence comparison

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract The q –gram distance between two strings $$s,s^\prime$$ , introduced by Ukkonen in 1992, is an alignment-free string similarity measure which can be computed in linear time, as opposed to the quadratic time necessary for alignment/edit distance. It is based on the $$L_1$$ -distance, or Manhattan-distance, between the multiplicity vectors of fixed-length substrings (so-called q-grams or k-mers ), and has been successfully applied in diverse bioinformatics settings. In this paper, we introduce the threshold q-gram distance (T q D), a new distance measure which is similar to the q -gram distance but uses reduced information on the multiplicities of the q -grams. The new measure retains the linear time computation of the q -gram distance but requires significantly less space. Storage space and accuracy of the measure can be controlled via a user-defined threshold t , which sets a limit on the maximum value of the integers in the multiplicity vectors. In particular, for $$t=1$$ , the comparison is made only on the basis of the sets of uniquely occurring q -grams on the one hand, and of repeated q -grams, on the other. We tested the new distance measure, using the benchmarking tool AFproject of Zielezinski et al. [Genome Biology, 2019], on several real-life data sets for phylogenetic reconstruction and compared the results with those of other k -mer based distance measures. Our experiments show that the new measure T q D compares well to other non-alignment based measures regarding accuracy, while requiring substantially less memory than the classic q -gram distance.

Similar Papers
  • Research Article
  • Cite Count Icon 144
  • 10.1109/tassp.1987.1165058
A weighted cepstral distance measure for speech recognition
  • Oct 1, 1987
  • IEEE Transactions on Acoustics, Speech, and Signal Processing
  • Y Tohkura

A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition system using standard DTW (dynamic time warping) techniques. The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measures across two different databases. The recognition error rate obtained using the weighted cepstral distance measure was about 1 percent for digit recognition. This result was less than one-fourth of that obtained using the simple Euclidean cepstral distance measure and about one-third of the results using the log likelihood ratio distance measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to equalize the performance of the recognizer across different talkers.

  • Conference Article
  • Cite Count Icon 33
  • 10.1109/icassp.1986.1169214
A weighted cepstral distance measure for speech recognition
  • Apr 1, 1986
  • Y Tohkura

A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition system using standard DTW (Dynamic Time Warping) techniques. The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measures across two different data bases, namely a 10 digits and a 129 airline vocabulary words. The recognition accuracy obtained using the weighted cepstral distance measure was about 992 for digit recognition. This result was more than 3% higher than that obtained using the simple Euclidean cepstral distance measure and about 2% higher than the results using the log likelihood ratio distance measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to equalize the performance of the recognizer across different talkers.

  • Conference Article
  • Cite Count Icon 40
  • 10.1109/cdc.2006.376759
Density Approximation Based on Dirac Mixtures with Regard to Nonlinear Estimation and Filtering
  • Jan 1, 2006
  • Oliver C Schrempf + 2 more

A deterministic procedure for optimal approximation of arbitrary probability density functions by means of Dirac mixtures with equal weights is proposed. The optimality of this approximation is guaranteed by minimizing the distance of the approximation from the true density. For this purpose a distance measure is required, which is in general not well defined for Dirac mixtures. Hence, a key contribution is to compare the corresponding cumulative distribution functions. This paper concentrates on the simple and intuitive integral quadratic distance measure. For the special case of a Dirac mixture with equally weighted components, closed-form solutions for special types of densities like uniform and Gaussian densities are obtained. Closed-form solution of the given optimization problem is not possible in general. Hence, another key contribution is an efficient solution procedure for arbitrary true densities based on a homotopy continuation approach. In contrast to standard Monte Carlo techniques like particle filters that are based on random sampling, the proposed approach is deterministic and ensures an optimal approximation with respect to a given distance measure. In addition, the number of required components (particles) can easily be deduced by application of the proposed distance measure. The resulting approximations can be used as basis for recursive nonlinear filtering mechanism alternative to Monte Carlo methods

  • Conference Article
  • Cite Count Icon 1
  • 10.5281/zenodo.39273
A new distance measure employing element-significance factors for robust image classification
  • Sep 8, 2005
  • Kunio Kawahara + 1 more

A new simple distance measure has been proposed in which each vector element is weighted in the distance calculation according to its importance as determined by taking its statistics into account. In order to reflect the characteristics of the class, the element-significance factors are calculated based on intraclass variances and mean values of vector elements and utilized in the distance measure. The proposed distance measure has been applied to the face detection system and the cephalometric landmarks identification system which we developed in other work. Improved performances in image classification have been demonstrated.

  • Research Article
  • 10.52783/jisem.v10i8s.989
Ridge Regressive Quadratic Multivalued Feature Matching Pursuit for Skill-based Employability Identification in Higher Education
  • Jan 10, 2025
  • Journal of Information Systems Engineering and Management
  • Bijithra N C

Skill-based employability identification involves evaluating students' skills and determining their suitability for a specific job or industry. Data mining techniques have been developed for predicting student employability based on certain skills. Skills identification is a crucial step for students in understanding employability. However, accurate and time-efficient prediction of student employability has become a pivotal focus for educational institutions. This paper introduces a novel approach using data mining techniques called Ridge Regressive Quadratic Multivalued Projection Matching Pursuit (RRQMPMP) to identify skill-based employability for students in higher education with better accuracy and minimum time consumption. The proposed RRQMPMP technique includes two major processes namely preprocessing and feature selection. First, the number of features and student data are collected from the dataset. Then the preprocessing steps are executed, including three processes namely missing data handling, duplicate data removal, and normalization to clean the input dataset. The Ridge regressive imputation method is employed to handle missing data in the dataset. Subsequently, duplicate and non-duplicate data points are distinguished from the dataset using a simple matching distance measure. Finally, quadratic mean feature scaling is developed for the normalization process. With the preprocessed dataset, the feature selection step is performed by applying a Russell-Rao index multivalued projection matching pursuit. Based on the Russell-Rao similarity index value, pertinent and impertinent features are identified. Finally, pertinent features are selected for skill-based student employability prediction to achieve higher accuracy and minimize time consumption as well as space complexity. An experimental evaluation is carried out with respect to accuracy, error rate, time complexity, and space complexity for different numbers of student data. The quantitatively analyzed results indicate that the performance of the proposed RRQMPMP technique increases the accuracy of skill-based student employability prediction with minimum time and space complexity compared to conventional methods.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/iembs.2011.6091107
Real-time retrieval of similar videos with application to computer-aided retinal surgery
  • Aug 1, 2011
  • G Quellec + 5 more

This paper introduces ongoing research on computer-aided ophthalmic surgery. In particular, a novel Content-Based Video Retrieval (CBVR) system is presented. Its purpose is the following: given a video stream captured by a digital camera monitoring the surgery, the system should retrieve, in real-time, similar video subsequences in video archives. In order to retrieve semantically-relevant videos, most existing CBVR systems rely on temporally flexible distance measures such as Dynamic Time Warping. These distance measures are slow and therefore do not allow real-time retrieval. In the proposed system, temporal flexibility is introduced in the way video subsequences are characterized, which allows the use of simple and fast distance measures. As a consequence, realtime retrieval of similar video subsequences, among hundreds of thousands of examples, is now possible. Besides, the proposed system is adaptive: a fast training procedure is presented. The system has been successfully applied to automated recognition of retinal surgery steps on a 69-video dataset: areas under the Receiver Operating Characteristic curves range from A(z)=0.809 to A(z)=0.989.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/igarss.2006.61
A Kernel Change Detection Algorithm in Remote Sense Imagery
  • Jul 1, 2006
  • Guorui Ma + 3 more

This paper proposes a novel kernel change detection algorithm (KCD). The input vectors from two images of different times are mapped into a potential much higher dimensional feature space via a nonlinear mapping, which will usually increase the linear margin of change and no-change regions. Then a simple linear distance measure between two high dimensional feature vectors is defined in features space, which corresponds to the complicated nonlinear distance measure in input space. Furthermore the distance measure's dot product is expressed in the combination of kernel functions and large numbers of dot product processed in input space by combined kernel tactic, which avoids the computational load. Finally this paper takes the soft margin single-class support vector machine (SVM) to select the optimal hyper-plane with maximum margin. Preliminary results show the kernel change detection algorithm (KCD) has excellent performance in accuracy.

  • Research Article
  • Cite Count Icon 22
  • 10.1109/jstars.2012.2234439
Fast Implementation of Maximum Simplex Volume-Based Endmember Extraction in Original Hyperspectral Data Space
  • Apr 1, 2013
  • IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • Liguo Wang + 3 more

Endmember extraction (EE) is a prerequisite task for spectral analysis of hyperspectral imagery. In all kinds of EE algorithms, maximum simplex volume-based ones, such as simplex growing algorithm (SGA) and N-FINDR algorithm, have been widely used for their fully automated and efficient performance. However, implementation of the algorithms needs dimension reduction of original data, and the algorithms include innumerable volume calculation. This leads to a low speed of the algorithms and thus becomes a limitation to their applications. In this paper, a simple distance measure is presented, and then, fast SGA and fast N-FINDR algorithm are constructed based on a proposed distance measure, which is free of dimension reduction and makes use of distance measure instead of volume evaluation to speed up the algorithm. The complexity of the proposed methods is compared with the original algorithms by theoretical analysis. Experiments show that the implementation of the two improved EE algorithms is much faster than that of the two original maximum simplex volume-based EE algorithms.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3139958.3140017
A Uniform Representation for Trajectory Learning Tasks
  • Nov 7, 2017
  • Qingzhe Li + 3 more

Most trajectory data are collected with a constant sample rate (e.g. GPS data). However, the variance of velocities can be very large, which causes the non-uniformity of the sample points in trajectory dataset. That is, the trajectory dataset can be very sparse in some parts which cause most existing distance measures to get unexpected results. On the other hand, the dataset can be extremely dense in some other parts which results in unnecessarily high computational complexity. Due to the above phenomenon, choosing an appropriate sample rate becomes a difficult challenge. In order to address the dilemma, we propose a Step-Invariant Trajectory (SIT) representation that can provide a dynamic sample rate to represent any trajectories in a uniform way. The translation takes only linear time. We also propose an effective and scalable distance measure for SIT representation. We evaluate the effectiveness and efficiency of our representation along with its distance measure by performing multiple trajectory classification and clustering experiments. These results show that our distance measures on SIT representation is much more accurate and robust than other representations and distance measures on sparse trajectory datasets. Our approach can also achieve competitive accuracy compared with the state of the art model-based trajectory representations on dense datasets. However, the time required to translate the data to our representation is 2 orders of magnitude faster, on average, than translate to other model-based representations. Furthermore, our representation can also serve as a preprocessing step to provide high quality input to all trajectory learning methods.

  • Research Article
  • Cite Count Icon 4
  • 10.1093/bib/bbac032
Heterogeneous cryo-EM projection image classification using a two-stage spectral clustering based on novel distance measures.
  • Mar 7, 2022
  • Briefings in Bioinformatics
  • Xiangwen Wang + 2 more

Single-particle cryo-electron microscopy (cryo-EM) has become one of the mainstream technologies in the field of structural biology to determine the three-dimensional (3D) structures of biological macromolecules. Heterogeneous cryo-EM projection image classification is an effective way to discover conformational heterogeneity of biological macromolecules in different functional states. However, due to the low signal-to-noise ratio of the projection images, the classification of heterogeneous cryo-EM projection images is a very challenging task. In this paper, two novel distance measures between projection images integrating the reliability of common lines, pixel intensity and class averages are designed, and then a two-stage spectral clustering algorithm based on the two distance measures is proposed for heterogeneous cryo-EM projection image classification. In the first stage, the novel distance measure integrating common lines and pixel intensities of projection images is used to obtain preliminary classification results through spectral clustering. In the second stage, another novel distance measure integrating the first novel distance measure and class averages generated from each group of projection images is used to obtain the final classification results through spectral clustering. The proposed two-stage spectral clustering algorithm is applied on a simulated and a real cryo-EM dataset for heterogeneous reconstruction. Results show that the two novel distance measures can be used to improve the classification performance of spectral clustering, and using the proposed two-stage spectral clustering algorithm can achieve higher classification and reconstruction accuracy than using RELION and XMIPP.

  • Research Article
  • 10.1109/tcbbio.2025.3590588
Closing the Complexity Gap of the Double Distance Problem.
  • Jan 1, 2025
  • IEEE transactions on computational biology and bioinformatics
  • Luís Cunha + 5 more

Genome rearrangement has been an active area of research in computational comparative genomics for the last three decades. While initially mostly an interesting algorithmic endeavor, now the practical application of rearrangement distance methods and more advanced phylogenetic tasks is becoming common practice, given the availability of many completely sequenced genomes. Several genome rearrangement models have been developed over time, sometimes with surprising computational properties. A prominent example is the fact that computing the reversal distance of two signed permutations is possible in linear time, while for two unsigned permutations it is NP-hard. Therefore one has to always be careful about the precise problem formulation and complexity analysis of rearrangement problems in order not to be fooled. The double distance is the minimum number of genomic rearrangements between a singular and a duplicated genome that - in addition to rearrangements - are separated by a whole genome duplication. At the same time it allows to assign the genes of the duplicated genome to the two paralogous chromosome copies that existed right after the duplication event. Computing the double distance is another example of a tricky hardness landscape: If the distance measure underlying the double distance is the simple breakpoint distance, the problem can be solved in linear time, while with the more elaborate DCJ distance it is NP-hard. Indeed, there is a whole family of distance measures, parameterized by an even number $k$, between the breakpoint distance ($k=2$) at the one end and the DCJ distance ($k=\infty$) at the other end. Only little was known about the hardness border that lies somewhere on the way between these two extremes. Precisely, beneath the two border cases the (linear) problem complexity was known only for $k=4$ and $k=6$. In this paper we close the gap, giving a full picture of the hardness landscape when computing the double distance.

  • Conference Article
  • Cite Count Icon 339
  • 10.1109/cvpr.2001.990935
Event-based analysis of video
  • Dec 1, 2001
  • L Zelnik-Manor + 1 more

Dynamic events can be regarded as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between video sequences (possibly of different lengths) based on their behavioral content. This measure is non-parametric and can thus handle a wide range of dynamic events. We use this measure for isolating and clustering events within long continuous video sequences. This is done without prior knowledge of the types of events, their models, or their temporal extent. An outcome of such a clustering process is a temporal segmentation of long video sequences into event-consistent sub-sequences, and their grouping into event-consistent clusters. Our event representation and associated distance measure can also be used for event-based indexing into long video sequences, even when only one short example-clip is available. However, when multiple example-clips of the same event are available (either as a result of the clustering process, or given manually), these can be used to refine the event representation, the associated distance measure, and accordingly the quality of the detection and clustering process.

  • Conference Article
  • Cite Count Icon 4
  • 10.1117/12.274165
<title>Wavelet-based feature extraction for mammographic lesion recognition</title>
  • Apr 25, 1997
  • Lori M Bruce + 1 more

In this paper, multiresolution analysis, specifically the discrete wavelet transform modulus-maximus method, is utilized for the extraction of mammographic lesion shape features. These shape features are used in a classification system to classify lesions as cysts, fibroadenomas, or carcinomas. The multiresolution shape features are compared with traditional uniresolution shape features for their class discriminating abilities. The study involves 60 digitized mammographic images. The lesions are segmented prior to introduction to the classification system. The uniresolution and multiresolution shape features are calculated using the radial distance measure of the lesion boundaries. The discriminating power of the shape features are analyzed via linear discriminant analysis. The classification system utilizes a simple Euclidean distance measure to determine class membership. The system is tested using the apparent and leave-one-out test methods. The results of the classification system when using the multiresolution and uniresolution shape features are classification rates of 83% and 80% for the apparent and leave-one-out test methods, respectively. These results are compared with those of the system when using only the uniresolution shape features. The uniresolution classification rates are 72% and 68% for the apparent and leave-one-out test methods, respectively. Keywords: wavelet transform, multiresolution analysis, feature extraction, classification, shape, mammography, image processing

  • Research Article
  • Cite Count Icon 33
  • 10.1121/1.1513647
A narrow band pattern-matching model of vowel perception.
  • Jan 28, 2003
  • The Journal of the Acoustical Society of America
  • James M Hillenbrand + 1 more

The purpose of this paper is to propose and evaluate a new model of vowel perception which assumes that vowel identity is recognized by a template-matching process involving the comparison of narrow band input spectra with a set of smoothed spectral-shape templates that are learned through ordinary exposure to speech. In the present simulation of this process, the input spectra are computed over a sufficiently long window to resolve individual harmonics of voiced speech. Prior to template creation and pattern matching, the narrow band spectra are amplitude equalized by a spectrum-level normalization process, and the information-bearing spectral peaks are enhanced by a "flooring" procedure that zeroes out spectral values below a threshold function consisting of a center-weighted running average of spectral amplitudes. Templates for each vowel category are created simply by averaging the narrow band spectra of like vowels spoken by a panel of talkers. In the present implementation, separate templates are used for men, women, and children. The pattern matching is implemented with a simple city-block distance measure given by the sum of the channel-by-channel differences between the narrow band input spectrum (level-equalized and floored) and each vowel template. Spectral movement is taken into account by computing the distance measure at several points throughout the course of the vowel. The input spectrum is assigned to the vowel template that results in the smallest difference accumulated over the sequence of spectral slices. The model was evaluated using a large database consisting of 12 vowels in /hVd/ context spoken by 45 men, 48 women, and 46 children. The narrow band model classified vowels in this database with a degree of accuracy (91.4%) approaching that of human listeners.

  • Research Article
  • Cite Count Icon 110
  • 10.1109/tpami.2006.194
Statistical analysis of dynamic actions
  • Sep 1, 2006
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • L Zelnik-Manor + 1 more

Real-world action recognition applications require the development of systems which are fast, can handle a large variety of actions without a priori knowledge of the type of actions, need a minimal number of parameters, and necessitate as short as possible learning stage. In this paper, we suggest such an approach. We regard dynamic activities as long-term temporal objects, which are characterized by spatio-temporal features at multiple temporal scales. Based on this, we design a simple statistical distance measure between video sequences which captures the similarities in their behavioral content. This measure is nonparametric and can thus handle a wide range of complex dynamic actions. Having a behavior-based distance measure between sequences, we use it for a variety of tasks, including: video indexing, temporal segmentation, and action-based video clustering. These tasks are performed without prior knowledge of the types of actions, their models, or their temporal extents.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.