Abstract

Novel contexts, comprising a set of terms referring to one or more concepts, may often arise in complex querying scenarios such as in evidence-based medicine (EBM) involving biomedical literature. These may not explicitly refer to entities or canonical concept forms occurring in a fact-based knowledge source, e.g. the UMLS ontology. Moreover, hidden associations between related concepts meaningful in the current context, may not exist within a single document, but across documents in the collection. Predicting semantic concept tags of documents can therefore serve to associate documents related in unseen contexts, or categorize them, in information filtering or retrieval scenarios. Thus, inspired by the success of sequence-to-sequence neural models, we develop a novel sequence-to-set framework with attention, for learning document representations in a unique unsupervised setting, using no human-annotated document labels or external knowledge resources and only corpus-derived term statistics to drive the training, that can effect term transfer within a corpus for semantically tagging a large collection of documents. Our sequence-to-set modeling approach to predict semantic tags, gives to the best of our knowledge, the state-of-the-art for both, an unsupervised query expansion (QE) task for the TREC CDS 2016 challenge dataset when evaluated on an Okapi BM25–based document retrieval system; and also over the MLTM system baseline baseline (Soleimani and Miller, 2016), for both supervised and semi-supervised multi-label prediction tasks on the del.icio.us and Ohsumed datasets. We make our code and data publicly available.

Highlights

  • Recent times have seen an upsurge in efforts towards personalized medicine where clinicians tai-1https://github.com/mcoqzeug/seq2set-semantic-tagging lor their medical decisions to the individual patient, based on the patient’s genetic information, other molecular analysis, and the patient’s preference

  • We develop a novel sequence-to-set end-toend encoder-decoder–based neural framework for multi-label prediction, by training document representations using no external supervision labels, for pseudo-relevance feedback–based unsupervised semantic tagging of a large collection of documents

  • We find that in this unsupervised task setting of Pseudo-relevance feedback (PRF)-based semantic tagging for query expansion, a multi-term prediction training objective that jointly optimizes both prediction of the TFIDF–based document pseudo-labels and the log likelihood of the labels given the document encoding, surpasses previous methods such as Phrase2VecGLM (Das et al, 2018) that used neural generalized language models for the same

Read more

Summary

Introduction

Recent times have seen an upsurge in efforts towards personalized medicine where clinicians tai-1https://github.com/mcoqzeug/seq2set-semantic-tagging lor their medical decisions to the individual patient, based on the patient’s genetic information, other molecular analysis, and the patient’s preference. Sequence-to-sequence (seq2seq) neural models often employing attention mechanisms, have been largely successful in delivering the state-of-the-art for tasks such as machine translation (Bahdanau et al, 2014), (Vaswani et al, 2017), handwriting synthesis (Graves, 2013), image captioning (Xu et al, 2015), speech recognition (Chorowski et al, 2015) and document summarization (Cheng and Lapata, 2016) Inspired by these successes, we aimed to harness the power of sequential encoder-decoder architectures with attention, to train end-to-end differentiable models that are able to learn the best possible representation of input documents in a collection while being predictive of a set of key terms that best describe the docu-

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.