Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández,Alberto Gil-Solla,José J Pazos-Arias,Diego Quisi-Peralta

doi:10.15388/23-infor527

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández, Alberto Gil-Solla + Show 2 more

Open Access

https://doi.org/10.15388/23-infor527

Copy DOI

Journal: Informatica	Publication Date: Jan 1, 2023
License type: CC BY 4.0

#General Corpora #In-domain Corpora + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

Full Text