Abstract

BackgroundThe binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results.ResultsIn this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes-2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively.ConclusionsThe selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity d achieve better capability to separate herbal medicine pairs compared to equations that exclude d.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1392-z) contains supplementary material, which is available to authorized users.

Highlights

  • The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics

  • We propose a method to select the most suitable similarity measures in the context of classification based on False Positive Rates (FPRs) and True Positive Rates (TPRs) by using Receiver Operating Characteristic (ROC) curve analysis

  • Determining appropriate binary similarity and dissimilarity coefficients is an essential aspect of big data analysis in versatile areas of scientific research including chemometrics and bioinformatics

Read more

Summary

Introduction

The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. Binary features have been commonly used to represent a great variety of data [1,2,3], expressing the binary status of samples as presence/absence, yes/no, or true/false It has many applications in the bioinformatics, chemometrics, and medical fields [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], as well as in pattern recognition, information retrieval, statistical analysis, and data mining [20, 21]. Todeschini et al [24] reported an analysis of 44 different similarity coefficients for computing the similarities between binary fingerprints by using simple descriptive statistics, correlation analysis, multidimensional scaling Hasse diagrams, and their proposed method ‘atemporal target diffusion model’

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.