Bidirectional Language Modeling: A Systematic Literature Review

Muhammad Shah Jahan,Shahzad Akbar,Sarah Gul,Muhammad Umar Farooq,Anam Amjad,Habib Ullah Khan,Fabrizio Riguzzi

doi:10.1155/2021/6641832

Abstract

In transfer learning, two major activities, i.e., pretraining and fine-tuning, are carried out to perform downstream tasks. The advent of transformer architecture and bidirectional language models, e.g., bidirectional encoder representation from transformer (BERT), enables the functionality of transfer learning. Besides, BERT bridges the limitations of unidirectional language models by removing the dependency on the recurrent neural network (RNN). BERT also supports the attention mechanism to read input from any side and understand sentence context better. It is analyzed that the performance of downstream tasks in transfer learning depends upon the various factors such as dataset size, step size, and the number of selected parameters. In state-of-the-art, various research studies produced efficient results by contributing to the pretraining phase. However, a comprehensive investigation and analysis of these research studies is not available yet. Therefore, in this article, a systematic literature review (SLR) is presented investigating thirty-one (31) influential research studies published during 2018–2020. Following contributions are made in this paper: (1) thirty-one (31) models inspired by BERT are extracted. (2) Every model in this paper is compared with RoBERTa (replicated BERT model) having large dataset and batch size but with a small step size. It is concluded that seven (7) out of thirty-one (31) models in this SLR outperforms RoBERTa in which three were trained on a larger dataset while the other four models are trained on a smaller dataset. Besides, among these seven models, six models shared both feedforward network (FFN) and attention across the layers. Rest of the twenty-four (24) models are also studied in this SLR with different parameter settings. Furthermore, it has been concluded that a pretrained model with a large dataset, hidden layers, attention heads, and small step size with parameter sharing produces better results. This SLR will help researchers to pick a suitable model based on their requirements.

Highlights

Transfer learning encompasses the model training on large text corpus and utilization of obtained knowledge to downstream tasks [1]
RoBERTa [17] which is the replication of bidirectional encoder representation from transformer (BERT) [27] with a large dataset, batch size, sequence size, parameter, layers, hidden layers, attention head but with small step size, no parameter sharing, and no sentence representation learning
The main advantage of comparison with RoBERTa is that it is a model built on BERT with slightly changed parameters and can generate fair comparison for all other models used in this research

Summary

Introduction

Transfer learning encompasses the model training on large text corpus and utilization of obtained knowledge to downstream tasks [1]. Before the emergence of transformer architecture for transfer learning, unidirectional language models were used extensively but these models faced many limitations such as reliance on unidirectional recurrent neural network (RNN) architecture and limited context vector size. To overcome these gaps, bidirectional language models such that bidirectional encoder representation from transformer (BERT) is introduced to improve the performance of downstream tasks. In the beginning, pretraining of bidirectional language models was done via supervised learning [6] but humanlabeled datasets are limited To resolve this issue, the use of a large corpus-based unsupervised learning increased

Methods

Results

Discussion

Conclusion