Abstract

With the advent of transformer-based architectures, the contextual representation of text data has leveraged the query and the document to be represented in low-dimensional dense vector space. These vectors are learned embeddings of fixed sizes, resulting in deeper text understanding. In this study, we designed a pipeline for effectively retrieving documents from a large search space by combining the deeper text understanding capabilities of the transformer-based BERT model and a phrase embedding-based query expansion model. To learn the contextual representations, we fine-tuned a deep semantic matching model by separately encoding the document and the query. The encoder model is based on the Sentence BERT (SBERT) architecture, which separately generates dense vector representations of documents and queries. The study has also addressed the maximum token length limitation of transformer-based models through the summarization of lengthy documents. In addition, to improve the clarity and completeness of short queries and reduce the semantic gap, a phrase embedding-based query expansion model is employed. The documents and their dense vectors are indexed using the Elasticsearch engine, and matched them with query vectors for retrieving query-specific documents. Finally, the BERT-based cross-encoder model is used to re-rank the relevant records for each query. It performs full self-attention over the inputs, and yields richer text interactions to produce the final results. To assess performance, experiments are conducted on two well-known datasets, TREC-CDS-2014 and OHSUMED. A comparative analysis is carried out, which clearly demonstrates that the proposed framework produced competitive retrieval results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call