String Similarity Research Articles

Teza/cel artykułu – Celem artykułu jest przedstawienie metody deduplikacji/ łączenia (ang. deduplication/linkage) rekordów opisujących jednostki bibliograficzne w bazach danych opartej na miarach podobieństw łańcuchów znakowych. Algorytm opracowano na podstawie własnych doświadczeń nabytych podczas tworzenia bibliograficznej bazy danych oraz podczas realizacji badań bibliometrycznych, na podstawie publicznie dostępnych bibliograficznych baz danych. Formalny opis metody zilustrowano przykładami zaczerpniętymi z krajowej bibliograficznej bazy CYTBIN. Metody badawcze – Opracowanie metody wymagało przeglądu architektur informacyjnych wybranych krajowych bibliograficznych baz danych, określenia typologii problemów ich dotykających, wynikających nie tylko z przyjętych modeli składowania danych, ale i budowy graficznych interfejsów użytkownika, którymi są zasilane, analizy i wyboru miar podobieństw łańcuchów znakowych oraz ostatecznie zaproponowania miary złożonej umożliwiającej ewaluację podobieństwa rekordów bibliograficznych w oparciu o wartości ich atrybutów składowych. Wyniki – Przedstawione na przykładzie danych pochodzących z wybranej bazy bibliograficznej wyniki pozwoliły empirycznie zweryfikować użyteczność zaproponowanej metody. Dodatkowo dokonano analizy rozkładu podobieństwa rekordów bibliograficznych bazy CYTBIN określanego na podstawie zaproponowanej metody złożonej i metody opartej na mierze Jaro-Winkler wyliczanej dla tytułów jednostek bibliograficznych. Wnioski – Zaproponowana metoda, po dostrojeniu jej parametrów do specyfiki (występujących anomalii) konkretnych baz bibliograficznych, może być wprost zastosowana do poprawy jakości opisów bibliograficznych w nich gromadzonych, zarówno w proaktywnym modelu pracy (przed zatwierdzeniem opisu przez operatora), jak i modelu reaktywnym (weryfikacja wszystkich lub nowo zgromadzonych rekordów wykonywana np. w czasie mniejszego obciążenia systemu w dobowych odstępach czasu).

MotivationThe amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.MethodsWe propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.ResultsThe experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.Availability and implementationA web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.

String Similarity Research Articles

Related Topics

Articles published on String Similarity

Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych

De indeling van de dialecten in Noord-Limburg en het aangrenzende Duitse gebied

Estudo para integração entre a Plataforma Lattes a Biblioteca Digital Brasileira de Teses e Dissertações (BDTD) e o Banco de Teses e Dissertações da Capes

Toponym matching through deep neural networks

FrepJoin: an efficient partition-based algorithm for edit similarity join

Learning to combine multiple string similarity metrics for effective toponym matching

Crowd-Guided Entity Matching with Consolidated Textual Data

LS-Join: Local Similarity Join on String Collections

A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources

Learning abstract visual concepts via probabilistic program induction in a Language of Thought

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

An Elusive Method to Identify Isomorphism and Inversions of Kinematic Chains and Mechanisms

A Novel Cost-Based Model for Data Repairing

Efficient string similarity join in multi-core and distributed systems

B-BabelNet: Business-Specific Lexical Database for Improving Semantic Analysis of Business Process Models

Efficient String Similarity Join in Multi-core and Distributed Systems

A Novel SSPS Framework for String Similarity Join

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Efficient String Edit Similarity Join Algorithm

Automatic Schema-Independent Linked Data Instance Matching System

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

String Similarity Research Articles

Related Topics

Articles published on String Similarity

Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych

De indeling van de dialecten in Noord-Limburg en het aangrenzende Duitse gebied

Estudo para integração entre a Plataforma Lattes a Biblioteca Digital Brasileira de Teses e Dissertações (BDTD) e o Banco de Teses e Dissertações da Capes

Toponym matching through deep neural networks

FrepJoin: an efficient partition-based algorithm for edit similarity join

Learning to combine multiple string similarity metrics for effective toponym matching

Crowd-Guided Entity Matching with Consolidated Textual Data

LS-Join: Local Similarity Join on String Collections

A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources

Learning abstract visual concepts via probabilistic program induction in a Language of Thought

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

An Elusive Method to Identify Isomorphism and Inversions of Kinematic Chains and Mechanisms

A Novel Cost-Based Model for Data Repairing

Efficient string similarity join in multi-core and distributed systems

B-BabelNet: Business-Specific Lexical Database for Improving Semantic Analysis of Business Process Models

Efficient String Similarity Join in Multi-core and Distributed Systems

A Novel SSPS Framework for String Similarity Join

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Efficient String Edit Similarity Join Algorithm

Automatic Schema-Independent Linked Data Instance Matching System