Multi-level out-of-vocabulary words handling approach

Johannes V Lochter,Renato M Silva,Tiago A Almeida

doi:10.1016/j.knosys.2022.108911

Abstract

Distributed representation models can generate a vector representation only for words that belong to a finite vocabulary collected from the training data. If out-of-vocabulary (OOV) words are not handled properly, they can impair the performance of machine learning methods in a given natural language processing task. This study offers a new methodology based on the consolidated top-down human reading theory, which may serve as a strong basis for developing new techniques to deal with the OOV problem. For this, we present MLOH, a Multi-Level OOV Handling approach, based on three chained strategies: analogy, decoding, and prediction. The techniques available in the literature, in general, are limited since they often resolve specific types of OOV words, such as those that can be inferred by analyzing their morphological structure or context. Compared to the process used by human readers to infer unknown words, using a single strategy is generally not effective. We evaluated MLOH performance on tasks that can be highly affected by OOV words, such as part-of-speech tagging, named entity recognition, and text categorization of short and noisy texts. The results indicate that the proposed approach is promising since it could handle most of the OOV words presented, is more generalist, and obtained competitive performance in all experiments.

Full Text