Pre-trained Language Model(PLM) refers to a natural language processing(NLP) model that has been pre-trained using large amounts of text data. The PLM has the limitation of not being able to understand domain-specific terminology due to a lack of training data for terminology. Therefore, the need for a domain-specific language model modified through BERT- or GPT-based pre-trained learning has recently been emphasized. In this study, we analyze BERT's pre-training method and BERT-based transformation techniques (ALBERT, RoBERTa, ELECTRA) and propose a PLM that can be used in biomedical, financial, and legal domains. The biomedical-specific pre-trained learning model is designed to learn domain-specific language characteristics such as technical terminology, medical sentence structure, and medical entity name recognition in the biomedical field. It is mainly adjusted to be applied to biomedical tasks through transfer learning based on BERT's pre-training method and architecture. For this purpose, it is pre-trained with pre-trained biomedical text data, and this pre-training transfers domain-specific knowledge to the model through learning representations for biomedical-related texts. The finance-specific pre-trained learning model is a model that can understand and process financial terminology, financial market trends, and sentence structures and vocabulary related to financial products and services. It can be used to generate news articles about financial market trends and to extract key information by concisely summarizing long texts such as financial reports and corporate press releases. Additionally, finance-specific pre-trained models help financial analysts generate investment recommendations based on a company's financial condition, performance, and prospects. The legal-specific pre-trained model is a language model suitable for legal documents and is used for legal document classification, legal document summarization, and legal document similarity evaluation. The legal-specific pre-learning model was created by pre-training the BERT model on special texts in the legal field, and through this, it learns characteristics specialized for legal documents. The performance of the legal-specific pre-training model can be improved to solve legal-related tasks through scratch pre-training and additional pre-training using legal corpora.
Read full abstract