Similarity corpus on microbial transcriptional regulation

Oscar Lithgow-Serrano,Alberto Santos-Zavaleta,Sara Martínez-Luna,Víctor H Tierrafría,Julio Collado-Vides,Socorro Gama-Castro,David Velázquez-Ramírez,Cecilia Ishida-Gutiérrez,Citlalli Mejía-Almonte

doi:10.1186/s13326-019-0200-x

Oscar Lithgow-Serrano, Alberto Santos-Zavaleta + Show 7 more

Open Access

https://doi.org/10.1186/s13326-019-0200-x

Copy DOI

Abstract

BackgroundThe ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource.ResultsGiven our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed.ConclusionsTo the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.

Highlights

The ability to express the same meaning in different ways is a well-known property of natural language
The sampling policy defines where and how the candidate texts are going to be selected, following three main criteria: the orientation, in this case a contrastive corpus with the aim of showing the language varieties that express the same meaning; the selection criteria that circumscribe candidates to written sentences in English taken from scientific articles on the topic of genetic regulation, where the sentence attitude3 is irrelevant and a specific content is not required; the sampling criteria consists of preselection of sentence pairs through a very basic Semantic Textual Similarity (STS) component followed by a filtering process to keep the same number of exemplars for each similarity grade, i.e., a balanced candidate set
The referred basic STS process was performed by a tool that we developed to compare the semantic similarity of two sentences using only their word embeddings

Summary

Introduction

The ability to express the same meaning in different ways is a well-known property of natural language. Expressing the same approximate meaning with different wording is a phenomenon widely present in the everyday use of natural language It shows the richness and polymorphic power of natural language, but it exhibits the complexity implied in understanding the conveyed meaning. The difficulty stems from the fact that it is very complicated to envisage all possible language feature variations to express the same idea, and so to have a broad perspective and to identify which features or relations are Lithgow-Serrano et al Journal of Biomedical Semantics (2019) 10:8 important It is for these steps that a paraphrase corpus is a very useful instrument, because it implicitly captures those nuances

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of biomedical semantics	Publication Date: May 22, 2019
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

Similarity corpus on microbial transcriptional regulation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical semantics

Lead the way for us

Similar Papers

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.
Mark Ormerod ... Jesús Martínez Del Rincón
JMIR Medical Informatics | VOL. 9
Mark Ormerod, et. al.Mark Ormerod ... Jesús Martínez Del Rincón
26 May 2021
JMIR Medical Informatics | VOL. 9

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.
Yanshan Wang ... Ozlem Uzuner
JMIR medical informatics | VOL. 8
Yanshan Wang, et. al.Yanshan Wang ... Ozlem Uzuner
27 Nov 2020
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.
Yanshan Wang ... Ozlem Uzuner

Sentence similarity measuring by vector space model
U. L. D. N. Gunasinghe ... A. S. Perera
-
U. L. D. N. Gunasinghe, et. al.U. L. D. N. Gunasinghe ... A. S. Perera
01 Dec 2014
01 Dec 2014

Pretrained Sentence Embedding and Semantic Sentence Similarity Language Model for Text Classification in NLP
V.Valli Mayil ... T.Ratha Jeyalakshmi
-
V.Valli Mayil, et. al.V.Valli Mayil ... T.Ratha Jeyalakshmi
18 Mar 2023
18 Mar 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Similarity corpus on microbial transcriptional regulation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical semantics