Fairness Definitions in Language Models Explained

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks. Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race, limiting their adoption in real‐world applications. Therefore, fairness has been extensively explored in LMs, leading to the proposal of various fairness notions. However, the lack of clear agreement on which fairness definition to apply in specific contexts and the complexity of understanding the distinctions between these definitions can create confusion and impede further progress. To this end, this paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs. Specifically, we begin with a brief introduction to LMs and fairness in LMs, followed by a comprehensive, up‐to‐date overview of existing fairness notions in LMs and the introduction of a novel taxonomy that categorizes these concepts based on their transformer architecture: encoder‐only, decoder‐only, and encoder‐decoder LMs. We further illustrate each definition through experiments, showcasing their practical implications and outcomes. Finally, we discuss current research challenges and open questions, aiming to foster innovative ideas and advance the field. The repository is publicly available online at https://github.com/vanbanTruong/Fairness‐in‐Large‐Language‐Models/tree/main/definitions . This article is categorized under: Commercial, Legal, and Ethical Issues > Fairness in Data Mining Commercial, Legal, and Ethical Issues > Social Considerations Technologies > Artificial Intelligence .

Similar Papers
  • Research Article
  • Cite Count Icon 7
  • 10.11834/jig.230110
A brief analysis of ChatGPT: historical evolution, current applications, and future prospects
  • Jan 1, 2023
  • Journal of Image and Graphics
  • Liu Yuliang + 3 more

近年来,人工智能技术接连取得突破,尤其是在强化学习、大规模语言模型和人工智能生成内容技术等方面,正逐步成为各个行业的创新驱动力。OpenAI于2022年11月30日发布的ChatGPT由于具有惊人的自然语言理解和生成能力,引起全社会大范围的关注,成为全球热议的话题,并被广泛应用于各个行业。仅两个月后,ChatGPT的月活跃用户数便达1亿,成为史上用户数增长最快的消费者应用。鉴于ChatGPT目前造成的影响,对其进行全面的分析较为必要。本文从历史沿革、应用现状和前景展望这3个角度对ChatGPT进行剖析,探究其对社会的影响、技术的原理和挑战以及未来发展的可能性,并从模型能力的角度简要介绍GPT-4相对于ChatGPT的改进。作为一个现象级技术产品,从技术角度而言ChatGPT对相关领域具有里程碑式的重要意义,从应用角度而言其可能会给人类社会带来巨大的影响。ChatGPT有潜力成为计算机领域最伟大的成就之一。但就目前而言,ChatGPT仍然存在一些局限,尚未达到强人工智能的水平。在当前阶段,研究人员需要对人工智能技术持有自信和谦虚学习的态度,继续发展相关的技术研究和应用。;Artificial intelligence(AI)technology has been developing intensively,especially for such scenarios in relevance to its applications of 1)natural language processing,2)computer vision,3)recommendation systems,and 4)forecast analysis. AI technology has been challenging for human cognition over the past decade. In recent years,natural language processing techniques can be focused on more. ChatGPT,as a case of emerging generative AI technology,is launched in December of 2022. ChatGPT,as an advanced language model,is commonly used on the basis of its a)larger model sizes,b)advanced pre-training methods,c)faster computing resources,and d)more language processing tasks. This ChatGPT-related literature review is focused on its(1)public awareness and application status,(2)characteristics, (3)mechanisms,(4)scalability,(5)challenges and limitations,(6)future development and application prospects,and (7)improvements of GPT-4 relative to ChatGPT. Cognitive computing and AI-based ChatGPT can be as a sort of language model in terms of the Transformer architecture and Generative Pre-Training(GPT). This GPT-trained model can be related to natural language processing,which can predict the probability distribution of the next token using a multi-layer Transformer to generate natural language text. It can be outreached by training the learned language patterns on a large corpus of text. The OpenAI's language model has shown a significant improvement in their level of intelligence from GPT-1(117 million parameters)in 2018 to GPT-3(175 billion parameters)in 2020. The language processing and generation capabilities of GPT have been improving dramatically in terms of consistent optimization like its 1)model size,2)generative models, and 3)self-supervised learning. Thereafter,reinforcement learning-based InstructGPT is originated from Human Feedback and such probability of infeasible,untrue,and biased outputs can be significantly reduced in January 2022. In December 2022,ChatGPT is introduced as the sister model of InstructGPT. ChatGPT is not only add InstructGPT-based chat attributes,and a test version is opened to the public. The core technologies of ChatGPT can be linked to 1)reinforcement learning from human feedback(RLHF),2)supervised fine-tuning(SFT),3)instruction fine-tnning(IFT),and 4)chain-ofthought (CoT)as well. ChatGPT has attracked about 100 million active users per month after the launch of two months. In comparison,TikTok took nine months to achieve 100 million monthly active users,and Instagram took two and a half years. According to Similar Web,more than 13 million independent visitors use ChatGPT on average each day in January of 2023,which is more than twice in December of 2022. The leading US new media company Buzzfeed accurately seized the opportunity of ChatGPT and saw its stock price triple in two days. The ChatGPT-derived impact shows its potential preference for consumers. The ChatGPT can play mulitiple roles for such domain like clinics,translation,official administrations,and programming tasks. Such extensive application of ChatGPT is still to be developed. However,while ChatGPT has the potential for widespread application in various industries,it cannot be universally applied to all industries. For example,as certain industrial production processes typically rely on digitalization and do not necessitate the handling of human language,natural language processing techniques may not be required. Furthermore,various other factors,such as legal restrictions and data privacy concerns,may also impinge upon the application of natural language processing technologies within certain industries. For industries that require the processing of sensitive information,such as the healthcare industry,natural language processing technologies may need to comply with strict legal regulations to ensure data privacy and security. In addition to industry-specific reasons,it should be noted that ChatGPT has not yet achieved perfection in natural language processing tasks. In summary,as a phenomenal and technological product,AI-generated ChatGPT's potentials are beneficial for textual and multi-modal AIGC applications to a certain extent,and it may have an impact on the a)survival of corporations,b)competition among countries,and c)entire social structure. However,the current various positive evaluations of ChatGPT can only be seen as a phenomenon of good rain after a long drought,and it cannot change the fact that ChatGPT is a questions and answers(Q&A)solution based on prior knowledge and models. It is required to be acknowledged that ChatGPT does not have its true recognition,intention,and creativity yet,and its true intelligence need to be tackled further.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.3390/app132312901
Contemporary Approaches in Evolving Language Models
  • Dec 1, 2023
  • Applied Sciences
  • Dina Oralbekova + 4 more

This article provides a comprehensive survey of contemporary language modeling approaches within the realm of natural language processing (NLP) tasks. This paper conducts an analytical exploration of diverse methodologies employed in the creation of language models. This exploration encompasses the architecture, training processes, and optimization strategies inherent in these models. The detailed discussion covers various models ranging from traditional n-gram and hidden Markov models to state-of-the-art neural network approaches such as BERT, GPT, LLAMA, and Bard. This article delves into different modifications and enhancements applied to both standard and neural network architectures for constructing language models. Special attention is given to addressing challenges specific to agglutinative languages within the context of developing language models for various NLP tasks, particularly for Arabic and Turkish. The research highlights that contemporary transformer-based methods demonstrate results comparable to those achieved by traditional methods employing Hidden Markov Models. These transformer-based approaches boast simpler configurations and exhibit faster performance during both training and analysis. An integral component of the article is the examination of popular and actively evolving libraries and tools essential for constructing language models. Notable tools such as NLTK, TensorFlow, PyTorch, and Gensim are reviewed, with a comparative analysis considering their simplicity and accessibility for implementing diverse language models. The aim is to provide readers with insights into the landscape of contemporary language modeling methodologies and the tools available for their implementation.

  • Book Chapter
  • 10.1007/978-3-030-31756-0_11
Progress in Neural Network Based Statistical Language Modeling
  • Oct 30, 2019
  • Anup Shrikant Kunte + 1 more

Statistical Language Modeling (LM) is one of the central steps in many Natural Language Processing (NLP) tasks including Automatic Speech recognition (ASR), Statistical Machine Translation (SMT) , Sentence completion, Automatic Text Generation to name a few. Good Quality Language Model has been one of the key success factors for many commercial NLP applications. Since past three decades diverse research communities like psychology, neuroscience, data compression, machine translation, speech recognition, linguistics etc, have advanced research in the field of Language Modeling. First we understand the mathematical background of LM problem. Further we review various Neural Network based LM techniques in the order they were developed. We also review recent developments in Recurrent Neural Network (RNN) Based Language Models. Early LM research in ASR gave rise to commercially successful class of LMs called as N-gram LMs. These class of models were purely statistical based and lacked in utilising the linguistic information present in the text itself. With the advancement in the computing power, availability of humongous and rich sources of textual data Neural Network based LM paved their way into the arena. These techniques proved significant, since they mapped word tokens into a continuous space than treating them as discrete. As NNLM performance was proved to be comparable to existing state of the art N-gram LMs researchers also successfully used Deep Neural Network to LM. Researchers soon realised that the inherent sequential nature of textual input make LM problem a good Candidate for use of Recurrent Neural Network (RNN) architecture. Today RNN is the choice of Neural Architecture to solve LM by most practitioners. This chapter sheds light on variants of Neural Network Based LMs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.3390/info14030195
Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
  • Mar 20, 2023
  • Information
  • Tilahun Yeshambel + 2 more

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 46
  • 10.1109/access.2019.2952360
Cross-Domain Sentiment Classification With Bidirectional Contextualized Transformer Language Models
  • Jan 1, 2019
  • IEEE Access
  • Batsergelen Myagmar + 2 more

Cross-domain sentiment classification is an important Natural Language Processing (NLP) task that aims at leveraging knowledge obtained from a source domain to train a high-performance learner for sentiment classification on a target domain. Existing transfer learning methods applied on cross-domain sentiment classification mostly focus on inducing a low-dimensional feature representation shared across domains based on pivots and non-pivots, which is still a low-level representation of sequence data. Recently, there have been great progress in the NLP literature in developing high-level representation language models based on Transformer architecture, which are pre-trained on large text corpus and fine-tuned for specific task with an additional layer on top. Among such language models, the bidirectional contextualized Transformer language models of BERT and XLNet have greatly impacted NLP research field. In this paper, we fine-tune BERT and XLNet for the cross-domain sentiment classification. We then explore their transferability in the context of cross-domain sentiment classification through in-depth analysis of two models' performances and update the state-of-the-arts with a significant margin of improvement. Our results show that such bidirectional contextualized language models outperform the previous state-of-the-arts methods for cross-domain sentiment classification while using up to 120 times less data.

  • Conference Article
  • Cite Count Icon 14
  • 10.5220/0011749400003393
German BERT Model for Legal Named Entity Recognition
  • Jan 1, 2023
  • Harshil Darji + 2 more

The use of BERT, one of the most popular language models, has led to improvements in many Natural Language Processing (NLP) tasks. One such task is Named Entity Recognition (NER) i.e. automatic identification of named entities such as location, person, organization, etc. from a given text. It is also an important base step for many NLP tasks such as information extraction and argumentation mining. Even though there is much research done on NER using BERT and other popular language models, the same is not explored in detail when it comes to Legal NLP or Legal Tech. Legal NLP applies various NLP techniques such as sentence similarity or NER specifically on legal data. There are only a handful of models for NER tasks using BERT language models, however, none of these are aimed at legal documents in German. In this paper, we fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset. To make sure our model is not overfitting, we performed a stratified 10-fold cross-validation. The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset. Finally, we make the model openly available via HuggingFace.

  • Conference Article
  • 10.1109/icassp43922.2022.9747525
Enhance Rnnlms with Hierarchical Multi-Task Learning for ASR
  • May 23, 2022
  • Minguang Song + 1 more

It is known that neural language models (NLMs) can implicitly learn certain linguistic information from text. While generally NLMs only use word feature input, the success of factored NLMs has indicated a benefit of using additional linguistic feature inputs for language modeling. On the other hand, multi-task learning (MTL) has shown positive effects on the generalization performance of various natural language processing (NLP) tasks, including language modeling. However, how to best share information among related tasks in MTL remains to be addressed. In this current work, we propose a hierarchical multi-task learning (HMTL) approach to incorporate linguistic knowledge into recurrent neural network language models (RNNLM), instead of using linguistic features as word factors. Specifically, we consider the auxiliary tasks of chunking, part of speech tagging, and named entity recognition, and supervise the learning of these auxiliary tasks in a hierarchical way. Our proposed method has the potential of helping language models learn knowledge of linguistic hierarchy from the auxiliary tasks, and improve the performance of RNNLMs on automatic speech recognition (ASR). We have evaluated our proposed HMTL method on WSJ and AMI speech recognition tasks. Our experiment results demonstrate the effectiveness of the proposed approach.

  • Research Article
  • Cite Count Icon 44
  • 10.1111/epi.17570
Are AI language models such as ChatGPT ready to improve the care of individuals with epilepsy?
  • Mar 13, 2023
  • Epilepsia
  • Christian M Boßelmann + 2 more

Epilepsy is a neurological disorder characterized by recurrent seizures, which can significantly impact the quality of life of affected individuals. Fortunately, advances in artificial intelligence (AI) are providing new opportunities to improve the diagnosis and treatment of epilepsy. Briefly, examples of ongoing epilepsy-related AI research include (1) algorithms that can analyze large amounts of electroencephalography (EEG) time-series data to label interictal epileptiform discharges both independently and with human supervision,1, 2 (2) diagnostic biomedical imaging with automated magnetic resonance imaging (MRI)–based lesion detection, surgical decision-making support, and outcome prediction,3, 4 and (3) Clinical Decision Support Systems (CDSS) that use patient data to provide physicians with recommendations based on up-to-date evidence and guidelines for an, overall, improved diagnostic and therapeutic accuracy.5, 6 Language models are often used in chatbots and other conversational systems to generate context-aware human-like text in response to an input prompt from a user. Such models are trained on large data sets of human conversations using machine learning (ML) techniques to learn the patterns and structure of natural language. Various artificial intelligence (AI) language models have been developed since the 1950s, but significant advances have only been made in recent years due to improved ML models paired with an increased availability of large amounts of data and computational resources. Some of the earliest examples of such models include ELIZA, developed in the 1960s (one of the first programs to simulate a patient-doctor relationship), and SHRDLU from the 1970s (a program able to emulate dialogue around a simplified world with a limited number of objects, the "blocks world").7, 8 However, these early language models were inherently limited in their capabilities and could perform only a narrow range of tasks. In recent years, more complex, large language models have led to significant progress in natural language processing. Several of these AI language models can be used for dialogue, for example, (1) GPT-3 (Generative Pre-trained Transformer 3), a state-of-the-art language model developed by OpenAI that can generate contextual human-like text for a wide range of applications, including dialogues9; (2) DialoGPT, a language model developed by Microsoft that is trained on a large data set of social media comment chains and can generate responses in single-turn conversations10; (3) Meena, a sensible and specific language model developed by Google that is trained on human–human conversations from public-domain social media and can generate responses that are coherent and contextually appropriate11; and (4) XLNet, a language model developed by Google and Carnegie Mellon University that is capable of several language modeling tasks including question answering, natural language inference, sentiment analysis, and document ranking; and many others.12 Mainly such algorithms enable the analysis of free-text electronic medical records and other written materials (e.g., test results and treatment plans) that are otherwise inaccessible without preprocessing and standardization. By analyzing large amounts of free-text medical records, language models can learn to identify and summarize relevant patterns. Possible outcomes are information on identified hierarchical patient subgroups based on seizure patterns, documented treatment options, and outcome parameters.13-15 This structured information could be queried to provide personalized treatment recommendations based on medical history and other relevant factors. For example, by identifying early candidates for epilepsy surgery, language models can help minimize treatment delays and improve patient outcomes.16, 17 Another example of how language models can improve health care are Clinical Decision Support Systems (CDSS) trained to understand and offer natural responses to queries from health care providers. CDSS can provide medical or surgical treatment recommendations, suggest relevant clinical guidelines or protocols, and alert health care providers to potential errors or risks. Similar methods may be used to create virtual assistants for individuals with epilepsy to answer questions and provide easy access to information about their condition, treatment options, and other related topics, including driving, causes of premature death (including sudden unexpected death in epilepsy [SUDEP]), and status epilepticus.18, 19 Overall, AI language models have the future potential to significantly improve the care and management of individuals with epilepsy by providing natural conversational interfaces to both patients and physicians, allowing for easy access to structured information. We tested ChatGPT (ChatGPT Dec 15 Version, available at chat.openai.com, last accessed 01/07/2023 at 9:30 p.m.) for some of the use cases outlined above and provided the prompts used and model responses in Figure 1. First, we assumed the role of an individual with epilepsy taking levetiracetam. The model correctly responded that aggression is a possible side effect and recommended follow-up with the prescribing physician (Figure 1A).20 We then requested an Acute Seizure Action Plan (ASAP), a structured treatment plan used to guide patients and caregivers in the event of an epileptic seizure. The model provided a reasonable first draft in line with expert recommendations (Figure 1B).21 We found this useful to quickly generate general patient-facing informational content, but note that each ASAP should be subject to human review to screen for misinformation, and to personalize the draft to include additional information from the individual's medical history and seizure types. We proceeded to present the model with a short, simplified case study of an individual with treatment-resistant left mesial temporal lobe epilepsy. Of interest, the model correctly integrated the medical history and diagnostic findings, noting that hippocampal sclerosis presents an epileptogenic lesion before proceeding to recommend epilepsy surgery. Although this assessment represents a simplification of phase I presurgical evaluation findings and surgical strategies, the overall recommendation is sound.22 However, limitations became apparent when we informed the model that the previously discussed patient now had additional evidence of right temporal lobe seizure onset. Although the initial response is still appropriate, the following advice is actively harmful (Figure 1D). The model confidently states that the patient's health care team may consider bilateral temporal lobectomy or removal of both temporal lobes and the adjacent frontal and parietal lobes (a procedure incorrectly defined as "hemispherotomy" by the model). Finally, even simple queries for structured information may fail if it concerns particularly specialized or disputed areas of knowledge. In Figure 1E, we queried if there is a relationship between variants in SCN9A and autosomal dominant epilepsy. The positive response was incorrect, likely due to misinformation in the academic literature present in the model's training data. Any relationship between variants in SCN9A and epilepsy has been refuted.23, 24 Previous research, as outlined above, has focused on language models trained on large amounts of public-domain data of general human conversations, commonly involving text messages from social media sites (Twitter, Reddit, Facebook, etc.) and some additional training data from books or academic literature. Indeed, the use cases shown above do not accurately represent the limits of this tool, as it was likely not trained on a sufficiently extensive, high-quality, domain-specific data set. It is important to note that language models cannot easily deal with disputed areas of knowledge and may not provide correct answers when contradictions are present in the input data. In light of these general considerations and the specific use cases outlined above, we argue that oversight from medical professionals will be needed to distill training information, and that all current AI applications need to be utilized in combination with human expertise. This is made immediately relevant by the fact that the broad ethical and legal implications of generative models are subjects of ongoing debate, with developers denying liability that may then fall onto the clinician user. Another important limitation of language models is an issue coined "hallucination," which describes confidently formulated answers with incorrect or nonsensical content.25 This misinformation is a result of biased training data or mismatches between token encoding and concept representation, and it is particularly difficult to identify. Finally, users should be aware that language models show bias against individuals based on gender, race, or disability.26 This issue is particularly sensitive in epilepsy, where stigma is still prevalent.27 Extraction of structured information from electronic medical records and assistance with simple human-supervised tasks are feasible use-case scenarios. However, these systems will need to be thoroughly tested and rigorously validated before they can be used in clinical care, in line with existing regulations on Software as a Medical Device or AI/ML-Enabled Medical Devices.28 Ultimately, AI language models in epilepsy care will depend on developing robust and reliable systems as per the Ethics Guidelines for Trustworthy Artificial Intelligence,29 driven by community-based data sharing and epilepsy-specific AI research. Outside of the clinical care of patients, several successful applications of language models (e.g., smart data processing, content generation, and sentiment analysis) provide a promising perspective of AI-augmented future clinical practice. To achieve similar success stories with AI language models in epilepsy and general clinical practice, we will need to develop protocols for applying decentralized language learning models (i.e., using federated learning) on distributed identifiable patient data from multiple institutions. These coordinated decentralized language models will take advantage of the collective knowledge and insights of multiple sources, including specialty fields like epilepsy, while protecting patient privacy. We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines. Christian M Boßelmann: Conceptualization, Writing – original draft; Costin Leu: Writing – review & editing; Dennis Lal: Writing – review & editing, Supervision. None. The authors report no conflicts of interest.

  • Conference Article
  • Cite Count Icon 2
  • 10.21437/interspeech.2019-1332
Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?
  • Sep 15, 2019
  • Lyan Verwimp + 1 more

Natural language processing (NLP) tasks tend to suffer from a paucity of suitably annotated training data, hence the recent success of transfer learning across a wide variety of them. The typical recipe involves: (i) training a deep, possibly bidirectional, neural network with an objective related to language modeling, for which training data is plentiful; and (ii) using the trained network to derive contextual representations that are far richer than standard linear word embeddings such as word2vec, and thus result in important gains. In this work, we wonder whether the opposite perspective is also true: can contextual representations trained for different NLP tasks improve language modeling itself? Since language models (LMs) are predominantly locally optimized, other NLP tasks may help them make better predictions based on the entire semantic fabric of a document. We test the performance of several types of pre-trained embeddings in neural LMs, and we investigate whether it is possible to make the LM more aware of global semantic information through embeddings pre-trained with a domain classification model. Initial experiments suggest that as long as the proper objective criterion is used during training, pre-trained embeddings are likely to be beneficial for neural language modeling.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 40
  • 10.18653/v1/2021.emnlp-main.117
Evaluating the Robustness of Neural Language Models to Input Perturbations
  • Jan 1, 2021
  • Milad Moradi + 1 more

High-performance neural language models have obtained state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks. However, results for common benchmark datasets often do not reflect model reliability and robustness when applied to noisy, real-world data. In this study, we design and implement various types of character-level and word-level perturbation methods to simulate realistic scenarios in which input texts may be slightly noisy or different from the data distribution on which NLP systems were trained. Conducting comprehensive experiments on different NLP tasks, we investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations. The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced. We highlight that models need to be further improved and that current benchmarks are not reflecting model robustness well. We argue that evaluations on perturbed inputs should routinely complement widely-used benchmarks in order to yield a more realistic understanding of NLP systems robustness.

  • Research Article
  • 10.54254/2755-2721/2025.po24899
A Survey on Pre-trained Language Models Based on Deep Learning: Technological Development and Applications
  • Jul 10, 2025
  • Applied and Computational Engineering
  • Yuansheng Lin

With the advent of the big data era and the enhancement of computing capabilities, deep learning technologies have achieved remarkable breakthroughs in the field of natural language processing (NLP). Pre-trained large language models, such as GPT and BERT, have significantly improved the performance of various NLP tasks, including text generation, question-answering systems, sentiment analysis, and machine translation, through pre-training on large-scale unsupervised data. This paper reviews the latest developments of pre-trained large language models based on deep learning, with a particular focus on the pre-training methods of BERT and GPT. Through a literature review and comparative analysis of models, this paper provides a detailed exploration of the core technologies of pre-trained models. The study finds that the Transformer architecture is the core of pre-trained models, significantly enhancing the performance of language models. However, the expansion of model size also brings increased computational costs and issues of interpretability. Future research directions include efficient pre-training methods, model compression and distillation, multimodal integration, as well as ethical and sustainability issues.

  • Research Article
  • Cite Count Icon 13
  • 10.7717/peerj-cs.2222
Natural language processing with transformers: a review.
  • Aug 7, 2024
  • PeerJ. Computer science
  • Georgiana Tucudean + 3 more

Natural language processing (NLP) tasks can be addressed with several deep learning architectures, and many different approaches have proven to be efficient. This study aims to briefly summarize the use cases for NLP tasks along with the main architectures. This research presents transformer-based solutions for NLP tasks such as Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-Training (GPT) architectures. To achieve that, we conducted a step-by-step process in the review strategy: identify the recent studies that include Transformers, apply filters to extract the most consistent studies, identify and define inclusion and exclusion criteria, assess the strategy proposed in each study, and finally discuss the methods and architectures presented in the resulting articles. These steps facilitated the systematic summarization and comparative analysis of NLP applications based on Transformer architectures. The primary focus is the current state of the NLP domain, particularly regarding its applications, language models, and data set types. The results provide insights into the challenges encountered in this research domain.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icedeg58167.2023.10122084
Keynote - AI for the Public Sector and the Case of Legal NLP
  • Apr 3, 2023
  • Matthias Stürmer

Recent innovations such as ChatGPT have increased public interest in artificial intelligence (AI). The keynote explained why AI is not just a short-term hype but has a long history of spanning several eras. A recent revolution has been in the field of Natural Language Processing (NLP). This interdisciplinary field of research is also known as computational linguistics. It is usually implemented by specific NLP tasks, ranging from simple processing steps such as tokenization, stemming, lemmatization to Part of Speech (PoS) tagging and topic modeling. A second, more complex set of NLP tasks includes Namend Entity Recognition (NER), information retrieval, relationship extraction, sentiment analysis, text similarity, and coreference resolution. Finally, the most challenging NLP tasks are considered Question Answering (QA), text summarization, text simplification, text generation, text translation, and chatbots. NLP has especially great potential in the public sector. For example, a new multilingual legal language model for more than 20 languages, developed for the Swiss Federal Court, offers opportunities to increase accessibility of legal documents for citizens while preserving the digital sovereignty of government institutions. These technical results of the National Research Program (NRP) 77 project “Open Justice versus Privacy” are published on Hugging Face, a platform for sharing openly available machine learning models and datasets. Today, it is mostly private companies that build such Large Language Models (LLM), because it requires a large amount of computational resources and highly skilled engineers. For example, to train the new LLaMA model, Meta AI (Facebook) needed more than $30 million worth of graphical processing units (GPU). In addition, 450 MWh of electricity worth about $90,000 was needed to process the data on these GPUs. Negative for innovation and the environment, Meta AI released the LLaMA model only under a non-commercial license. This means that startups and other companies cannot use the model for their own services. This calls for a discussion about how “open” today's machine learning models should be and what “open” actually means in the age of AI. The keynote presentation therefore included a proposal of 5 elements of such machine learning models that need to be openly available and licensed under an official open license in order to speak of an Open AI Model. This term is used by the United Nations definition of Digital Public Goods. These five elements include 1) model architecture (detailed scientific publications), 2) hyperparameters (built configuration), 3) training data (labeled and unlabeled datasets), 4) model weights and intermediate checkpoints (parameters), and 5) source code to build the model (programming scripts etc.). A truly openly available AI model is BLOOM, an LLM from the BigScience initiative. It was built by more than 1000 researchers from over 70 countries, trained on an infrastructure that would have cost EUR 3 million. BLOOM was released on July 12th, 2022 on Hugging Face and is licensed under the Responsible AI License (RAIL), a new type of AI license that incorporates ethical aspects while preserving the openness of the machine learning elements described.

  • Research Article
  • Cite Count Icon 1
  • 10.5075/epfl-thesis-7148
Word Embeddings for Natural Language Processing
  • Jan 1, 2016
  • Rémi Lebret

Word embedding is a feature learning technique which aims at mapping words from a vocabulary into vectors of real numbers in a low-dimensional space. By leveraging large corpora of unlabeled text, such continuous space representations can be computed for capturing both syntactic and semantic information about words. Word embeddings, when used as the underlying input representation, have been shown to be a great asset for a large variety of natural language processing (NLP) tasks. Recent techniques to obtain such word embeddings are mostly based on neural network language models (NNLM). In such systems, the word vectors are randomly initialized and then trained to predict optimally the contexts in which the corresponding words tend to appear. Because words occurring in similar contexts have, in general, similar meanings, their resulting word embeddings are semantically close after training. However, such architectures might be challenging and time-consuming to train. In this thesis, we are focusing on building simple models which are fast and efficient on large-scale datasets. As a result, we propose a model based on counts for computing word embeddings. A word co-occurrence probability matrix can easily be obtained by directly counting the context words surrounding the vocabulary words in a large corpus of texts. The computation can then be drastically simplified by performing a Hellinger PCA of this matrix. Besides being simple, fast and intuitive, this method has two other advantages over NNLM. It first provides a framework to infer unseen words or phrases. Secondly, all embedding dimensions can be obtained after a single Hellinger PCA, while a new training is required for each new size with NNLM. We evaluate our word embeddings on classical word tagging tasks and show that we reach similar performance than with neural network based word embeddings. While many techniques exist for computing word embeddings, vector space models for phrases remain a challenge. Still based on the idea of proposing simple and practical tools for NLP, we introduce a novel model that jointly learns word embeddings and their summation. Sequences of words (i.e. phrases) with different sizes are thus embedded in the same semantic space by just averaging word embeddings. In contrast to previous methods which reported a posteriori some compositionality aspects by simple summation, we simultaneously train words to sum, while keeping the maximum information from the original vectors. These word and phrase embeddings are then used in two different NLP tasks: document classification and sentence generation. Using such word embeddings as inputs, we show that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network. Finding good compact representations of text documents is crucial in classification systems. Based on the summation of word embeddings, we introduce a method to represent documents in a low-dimensional semantic space. This simple operation, along with a clustering method, provides an efficient framework for adding semantic information to documents, which yields better results than classical approaches for classification. Simple models for sentence generation can also be designed by leveraging such phrase embeddings. We propose a phrase-based model for image captioning which achieves similar results than those obtained with more complex models. Not only word and phrase embeddings but also embeddings for non-textual elements can be helpful for sentence generation. We, therefore, explore to embed table elements for generating better sentences from structured data. We experiment this approach with a large-scale dataset of biographies, where biographical infoboxes were available. By parameterizing both words and fields as vectors (embeddings), we significantly outperform a classical model.

  • Research Article
  • 10.3233/shti200127
Character-Level Neural Language Modelling in the Clinical Domain.
  • Jan 1, 2020
  • Studies in health technology and informatics
  • Kreuzthaler Markus + 2 more

Word embeddings have become the predominant representation scheme on a token-level for various clinical natural language processing (NLP) tasks. More recently, character-level neural language models, exploiting recurrent neural networks, have again received attention, because they achieved similar performance against various NLP benchmarks. We investigated to what extent character-based language models can be applied to the clinical domain and whether they are able to capture reasonable lexical semantics using this maximally fine-grained representation scheme. We trained a long short-term memory network on an excerpt from a table of de-identified 50-character long problem list entries in German, each of which assigned to an ICD-10 code. We modelled the task as a time series of one-hot encoded single character inputs. After the training phase we accessed the top 10 most similar character-induced word embeddings related to a clinical concept via a nearest neighbour search and evaluated the expected interconnected semantics. Results showed that traceable semantics were captured on a syntactic level above single characters, addressing the idiosyncratic nature of clinical language. The results support recent work on general language modelling that raised the question whether token-based representation schemes are still necessary for specific NLP tasks.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.