Abstract

In recent years, pre-trained transformer encoders have achieved successful milestones in natural language understanding (NLU). However, they still suffer from the representations with insufficient semantics, the sequences with finite lengths, and the need for a considerable number of parameters. These issues lead to difficulties not only in practical deployment under limited computational resources but also in downstream applications. To address such issues, this paper presents ELICE, a method of embedding language through an informative contrastive-encoder. Through the strategies of unordered composition and dropout-based contrastive learning, ELICE can enhance long documents embedding under the desirable model size and producing generic textual representations by condensing multiple unstructured features, such as titles, metadata, and segments of content. Furthermore, the contrastive learning by the NT-Xent objective is experimentally proven to ameliorate the document-level embedding space for a more isotropic one. ELICE was evaluated on four tasks of the CLUE benchmark (Mandarin NLU tasks). With the basic version, the ELICE-base has 12 million parameters and achieves 99% of the performance of the BERT-base (108 million) on average. Furthermore, on the IMDb sentiment classification (English NLU task), ELICE-base (13 million) achieves 95% of the accuracy of XLNet (110 million). Additionally, we deployed ELICE in a practical system for real-time ad-hoc document retrieval regarding legal compliance in the financial industry, and the fine-tuned ELICE outperformed the pre-trained baseline and the whitening representations. These results demonstrate the effectiveness, the robustness, and the flexibility of ELICE in the field of language understanding.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call