Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

Yao-Zhong Zhang,Zeheng Bai,Seiya Imoto

doi:10.1093/bioinformatics/btad617

Yao-Zhong Zhang, Zeheng Bai + Show 1 more

Open Access

https://doi.org/10.1093/bioinformatics/btad617

Copy DOI

Journal: Bioinformatics (Oxford, England)	Publication Date: Oct 10, 2023
Citations: 1	License type: CC BY 4.0

Affiliation: The University of Tokyo

Abstract

In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated. In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks. The source code and associated data can be accessed at https://github.com/yaozhong/bert_investigation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)

Lead the way for us

Similar Papers

Engineering Document Summarization Using Sentence Representations Generated by Bidirectional Language Model
Yan Jin ... Yunjian Qiu
-
Yan Jin, et. al.Yan Jin ... Yunjian Qiu
17 Aug 2021
17 Aug 2021

Bert model fine-tuning for text classification in knee OA radiology reports
L Chen ... V Pedoia
Osteoarthritis and Cartilage | VOL. 28
L Chen, et. al.L Chen ... V Pedoia
01 Apr 2020
Osteoarthritis and Cartilage | VOL. 28

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification
Xuedong Li ... Qiaozhu Mei
BMC Medical Informatics and Decision Making | VOL. 21
Xuedong Li, et. al.Xuedong Li ... Qiaozhu Mei
01 Nov 2021
BMC Medical Informatics and Decision Making | VOL. 21

Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization
Fatima-Zahra El-Alami ... Noureddine En Nahnahi
Journal of King Saud University - Computer and Information Sciences | VOL. 34
Fatima-Zahra El-Alami, et. al.Fatima-Zahra El-Alami ... Noureddine En Nahnahi
18 Feb 2021
Journal of King Saud University - Computer and Information Sciences | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics (Oxford, England)