Slovak morphological tokenizer using the Byte-Pair Encoding algorithm.

Dávid Držík,Frantisek Forgac

doi:10.7717/peerj-cs.2465

Dávid Držík, Frantisek Forgac

https://doi.org/10.7717/peerj-cs.2465

Copy DOI

Export

Save

Cite

Journal: PeerJ. Computer science	Publication Date: Nov 19, 2024
License type: CC BY 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

This study introduces a new approach to text tokenization, SlovaK Morphological Tokenizer (SKMT), which integrates the morphology of the Slovak language into the training process using the Byte-Pair Encoding (BPE) algorithm. Unlike conventional tokenizers, SKMT focuses on preserving the integrity of word roots in individual tokens, crucial for maintaining lexical meaning. The methodology involves segmenting and extracting word roots from morphological dictionaries and databases, followed by corpus preprocessing and training SKMT alongside a traditional BPE tokenizer. Comparative evaluation against existing tokenizers demonstrates SKMT's outstanding ability to maintain root integrity, achieving 99.7% root integrity compared to SlovakBERT (90.5%) and a pureBPE tokenizer (93.1%). Further validation involved fine-tuning models on a sentiment classification NLP task, where models trained with SKMT achieved an F1-score improvement of 3.5% over those trained with conventional BPE tokenization, followed by a focus on the Semantic Textual Similarity (STS) task. These findings suggest that training language models on the SKMT tokenizer significantly enhances model performance and quality.

Full Text