Abstract
BackgroundSemantic textual similarity (STS) is a fundamental natural language processing (NLP) task which can be widely used in many NLP applications such as Question Answer (QA), Information Retrieval (IR), etc. It is a typical regression problem, and almost all STS systems either use distributed representation or one-hot representation to model sentence pairs.MethodsIn this paper, we proposed a novel framework based on a gated network to fuse distributed representation and one-hot representation of sentence pairs. Some current state-of-the-art distributed representation methods, including Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory networks (Bi-LSTM) and Bidirectional Encoder Representations from Transformers (BERT), were used in our framework, and a system based on this framework was developed for a shared task regarding clinical STS organized by BioCreative/OHNLP in 2018.ResultsCompared with the systems only using distributed representation or one-hot representation, our method achieved much higher Pearson correlation. Among all distributed representations, BERT performed best. The highest Person correlation of our system was 0.8541, higher than the best official one of the BioCreative/OHNLP clinical STS shared task in 2018 (0.8328) by 0.0213.ConclusionsDistributed representation and one-hot representation are complementary to each other and can be fused by gated network.
Highlights
Semantic textual similarity (STS) is a fundamental natural language processing (NLP) task which can be widely used in many NLP applications such as Question Answer (QA), Information Retrieval (IR), etc
Distributed representation In this study, we investigated three types of distributed representations: Siamese Convolutional Neural Network (CNN) [32], Siamese RNN [21] and Bidirectional Encoder Representations from Transformers (BERT) [33], where Siamese CNN and Siamese RNN are two popular neural networks used to represent sentence pair, while BERT is a new language representation method proposed recently
We adopted Bi-directional Long Short Term Memory (Bi-LSTM) networks as an implementation of RNN, where each word i at s1 and s2 is represented as: Fig. 2 Overview architecture of our distributed representation and one-hot representation fusion system based on gated network s1i hi 1⁄4 LSTM hiþ1; s1i i 1⁄4 1; ...; m ð2Þ
Summary
Task definition Formally, the clinical STS task is to determine the similarity of a pair of given sentences, denoted by sim(s1, s2), where s1 is a sentence of length m and s2 is a sentence of length n. We adopted Bi-directional Long Short Term Memory (Bi-LSTM) networks as an implementation of RNN, where each word i at s1 and s2 is represented as: Fig. 2 Overview architecture of our distributed representation and one-hot representation fusion system based on gated network. One-hot representation We followed Tian’s work [34] to extract the following two types of features: (1) Sentence-level features: IDF (inverse document frequency) [35] and sentence length; (2) Sentence pair-level features: N-gram overlaps defined in eq (6), and distances or similarities between the two input sentences calculated by cosine, Manhattan, Euclidean, Chebyshev, polynomial kernel, RBF kernel, Laplacian kernel and sigmoid kernel after each sentence is represented by the average vector of all words’ embeddings (https://github.com/mmihaltz/ word2vec-GoogleNews-vectors). All systems were evaluated on the clinical STS corpus of the BioCreative/OHNLP challenge in 2018, and Pearson correlation was used to measure the performance of the systems
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have