Kinds Of Similarity Measures Research Articles

Existing bug triage approaches for developer recommendation systems are mainly based on machine learning (ML) techniques. These approaches have shown low prediction accuracy and high bug tossing length (BTL). The objective of this paper is to develop a robust algorithm for reducing BTL based on the concept of developer expertise score (DES). None of the existing approaches to the best of our knowledge have utilized metrics to build developer expertise score. The novel strategy of DES is consisted of two stages: Stage-I consisted of an offline process for detecting the developers based on DES which computes the score using priority, versatility and average fix-time for his individual contributions. The online system process consisted of finding the capable developers using three kinds of similarity measures (feature-based, cosine-similarity and Jaccard). Stage-II of the online process consisted of simply ranking the developers. Hit-ratio and reassignment accuracy were used for performance evaluation. We compared our system against the ML-based bug triaging approaches using three types of classifiers: Navies Bayes, Support Vector Machine and C4.5 paradigms. By adapting the five open source databases, namely: Mozilla, Eclipse, Netbeans, Firefox, and Freedesktop, covering 41,622 bug reports, our novel DES system yielded a mean accuracy, precision, recall rate and F-score of 89.49%, 89.53%, 89.42% and 89.49%, respectively, reduced BTLs of up to 88.55%. This demonstrates an improvement of up to 20% over existing strategies. This work presented a novel developer recommendation algorithm to rank the developers based on a metric-based integrated score for bug triaging. This integrated score was based on the developer's expertise with an objective to improve (i) bug assignment and (ii) reduce the bug tossing length. Such architecture has an application in software bug triaging frameworks.

Read full abstract

BackgroundThe ability to efficiently search and filter datasets depends on access to high quality metadata. While most biomedical repositories require data submitters to provide a minimal set of metadata, some such as the Gene Expression Omnibus (GEO) allows users to specify additional metadata in the form of textual key-value pairs (e.g. sex: female). However, since there is no structured vocabulary to guide the submitter regarding the metadata terms to use, consequently, the 44,000,000+ key-value pairs in GEO suffer from numerous quality issues including redundancy, heterogeneity, inconsistency, and incompleteness. Such issues hinder the ability of scientists to hone in on datasets that meet their requirements and point to a need for accurate, structured and complete description of the data.MethodsIn this study, we propose a clustering-based approach to address data quality issues in biomedical, specifically gene expression, metadata. First, we present three different kinds of similarity measures to compare metadata keys. Second, we design a scalable agglomerative clustering algorithm to cluster similar keys together.ResultsOur agglomerative cluster algorithm identified metadata keys that were similar, based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them for most metadata keys and achieved the best average F-Score (0.63).ConclusionOur algorithm identified keys that were similar to each other and grouped them together. Our intuition that underpins cleaning by clustering is that, dividing keys into different clusters resolves the scalability issues for data observation and cleaning, and keys in the same cluster with duplicates and errors can easily be found. Our algorithm can also be applied to other biomedical data types.

Read full abstract

Kinds Of Similarity Measures Research Articles

Related Topics

Articles published on Kinds Of Similarity Measures

Locating the propagation source in complex networks with observers-based similarity measures and direction-induced search

Matching heterogeneous ontologies with adaptive evolutionary algorithm

Classification for Polsar image based on hölder divergences

Ranking of software developers based on expertise score for bug triaging

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Discriminating the reaction types of plant type III polyketide synthases.

Exploring information from the topology beneath the Gene Ontology terms to improve semantic similarity measures

A new clustering algorithm based on near neighbor influence

Some new approaches to constructing similarity measures

A Terminological Search Algorithm for Ontology Matching

Context-Aware Service Selection with Uncertain Context Information

Appropriate similarity measures for author co‐citation analysis

Shape-based time series similarity measure and pattern discovery algorithm

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Kinds Of Similarity Measures Research Articles

Related Topics

Articles published on Kinds Of Similarity Measures

Locating the propagation source in complex networks with observers-based similarity measures and direction-induced search

Matching heterogeneous ontologies with adaptive evolutionary algorithm

Classification for Polsar image based on hölder divergences

Ranking of software developers based on expertise score for bug triaging

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata

Discriminating the reaction types of plant type III polyketide synthases.

Exploring information from the topology beneath the Gene Ontology terms to improve semantic similarity measures

A new clustering algorithm based on near neighbor influence

Some new approaches to constructing similarity measures

A Terminological Search Algorithm for Ontology Matching

Context-Aware Service Selection with Uncertain Context Information

Appropriate similarity measures for author co‐citation analysis

Shape-based time series similarity measure and pattern discovery algorithm