Abstract

This paper presents a method for vector representations and dimensionality reduction of documents using pretrained language models and Uniform Manifold Approximation and Projection (UMAP). The method aims at visualizing Vietnam’s scientific research projects in order to help searching for, as well as exploring, similar projects given a new proposal or research topic. First, documents are vectorized using a pretrained language model. Then, the obtained document vectors are projected onto a two-dimensional space using UMAP. Given a query, it is also passed through two steps as a document. In the two-dimensional space, each document is represented as a circle and the nearest circles are, the more similar the corresponding documents are. We consider the abstract or title of a project as its representative and call each as a document. We conduct experiments in order to compare the representation power of multilingual BERT-base and PhoBERT by training classifiers using softmax, support vector machines, and multilayer perception; and visualizing the representations using PCA, t-SNE and UMAP, respectively. The experimental results show the representation power of PhoBERT is better than that of multilingual BERT-base and UMAP is superior to PCA and t-SNE. We also present a visualizing tool allowing human intervention in similarity search.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.