Chapter 9 - Big Data Driven Natural Language Processing Research and Applications
Chapter 9 - Big Data Driven Natural Language Processing Research and Applications
- Research Article
14
- 10.1162/coli_a_00420
- Dec 7, 2021
- Computational Linguistics
Natural Language Processing and Computational Linguistics
- Research Article
- 10.1162/coli_r_00388
- Oct 29, 2020
- Computational Linguistics
Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions.This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP.The book focuses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.Beginning with a detailed background on significance testing, the book then shows the reader how to carry out tests for specific NLP tasks. There is a mix of styles, with the first four chapters providing reference material that will be extremely useful to both new and experienced researchers. Here, it is easy to find the material related to a given NLP task. The next two chapters discuss more recent research into the application of significance tests to deep neural networks and for testing across multiple datasets. Alongside open research questions, these later chapters provide clear guidelines on how to apply the proposed methods. It is this mix of background material and reference guidelines that I believe makes this book so compelling and nicely self-contained.The introduction in Chapter 1 motivates the need for a comprehensive textbook and outlines challenges that the later chapters address more deeply. The theoretical background material begins in Chapter 2, which introduces core concepts, including hypothesis testing, type 1 and type 2 errors, validity and power, and p-values. The reader does not need to have any prior knowledge of statistical significance tests to follow this part. However, experienced readers could still benefit from reading this chapter, as concepts such as p-values are widely misunderstood and misused (Amrhein, Greenland, and McShane 2019).The significance tests themselves are introduced in Chapter 3, categorized into parametric and nonparametric tests. The chapter begins with the intuitively simple paired z-test, then builds up to more commonly-applied techniques, showing the connections and assumptions that each test makes. Step-by-step algorithms help the reader to implement each test. Although the chapter does cite uses of tests in NLP research, the main purpose is to present the theory behind each test and point out their differences.Chapter 4 provides perhaps the most handy part of the book for reference: a correspondence between common NLP tasks and statistical tests. Each task is discussed in terms of the evaluation metrics used, then a decision tree is introduced to guide the reader toward a choice between a parametric test, bootstrap or randomization test, or sampling-free nonparametric test. Section 4.3 then links each NLP evaluation measure to a specific significance test, presenting a large table that helps readers identify which test they need for a specific task. Particular considerations for each task are also pointed out to provide more detail about the appropriate options. The final part of this chapter describes the issue of p-hacking, in which dataset sizes are increased until a significance threshold is reached without consideration for biases in the data (discussed, for example, in Hofmann [2015]). The chapter proposes a simple solution to ensure robust significance testing with large datasets.Where Chapter 4 presents well-established methods, Chapter 5 introduces the current research question of how best to apply statistical significance testing to deep learning. Non-convex loss functions, stochastic optimization, random initialization, and a multitude of hyperparameters limit the conclusions we can draw from a single test run of a deep neural network (DNN). This chapter, which is based on the authors’ ACL paper (Dror, Shlomov, and Reichart 2019), explains how the comparison process can be overhauled to provide more meaningful evaluations. Beginning by explaining the difficulties of evaluating DNNs, the chapter then introduces criteria for a comparison framework, then discusses the limitations of current methods. Reimers and Gurevych (2018) have previously tackled this problem, but their approach has limited power and does not provide a confidence score. Other works, such as Clark et al. (2011), compare DNNs using a collection of statistics, such as the mean or standard deviation of performance across runs. This book shows how such an approach violates the assumptions of the significance tests. The authors propose almost stochastic dominance as the basis for a better alternative. The chapter explains how to use the proposed method, evaluates it in an empirical case study, and finally analyzes the errors made by each testing approach.Large NLP models are often tested across a range of datasets, which presents another problem for standard significance testing. Chapter 6 discusses the challenges of assessing two questions: (1) On how many datasets does algorithm A outperform algorithm B? (2) On which datasets does A outperform B? Applying standard significance tests individually to each dataset and counting the number of significant results is likely to overestimate the total number of significant results, as this chapter explains. The authors then present a new framework for replicability analysis, based on partial conjunction testing, and discuss two variants (Bonferroni and Fisher) for when the datasets are independent or dependent. They introduce a method based on Benjamini and Heller (2008) to count the number of datasets where one method outperforms another, then show how to use the Holm procedure (Holm 1979) to identify which datasets these are. Chapter 6 provides a lot of detailed background on the proposed replicability analysis framework, and the later sections again link the process to specific NLP case studies, and step-by-step summaries help the reader to apply the methodology. Extensive empirical results illustrate the very substantial differences in outcomes between the proposed approach and standard procedures.The final two chapters present open problems and conclude, showing that the topic has many interesting research questions of its own, such as problems when performing cross-validation, and the limited statistical power of replicability analysis.Overall, I highly recommend this book to a wide range of NLP researchers, from new students to seasoned experts who wish to ensure that they compare methods effectively. The book is excellent as both an introduction to the topic of significance testing and as a reference to use when evaluating your results. For anyone with further interest in the topic, it also points the way to future work. If one could level any criticism at this book at all, it is that it does not deeply discuss the basic flaws of significance testing or what the alternatives might be. For now, though, significance testing is an integral part of NLP research and this book provides a great resource for researchers who wish to perform it correctly and painlessly.
- Research Article
1
- 10.1017/s135132491200006x
- Mar 14, 2012
- Natural Language Engineering
During last decade, machine learning and, in particular, statistical approaches have become more and more important for research in Natural Language Processing (NLP) and Computational Linguistics. Nowadays, most stakeholders of the field use machine learning, as it can significantly enhance both system design and performance. However, machine learning requires careful parameter tuning and feature engineering for representing language phenomena. The latter becomes more complex when the system input/output data is structured, since the designer has both to (i) engineer features for representing structure and model interdependent layers of information, which is usually a non-trivial task; and (ii) generate a structured output using classifiers, which, in their original form, were developed only for classification or regression. Research in empirical NLP has been tackling this problem by constructing output structures as a combination of the predictions of independent local classifiers, eventually applying post-processing heuristics to correct incompatible outputs by enforcing global properties. More recently, some advances of the statistical learning theory, namely structured output spaces and kernel methods, have brought techniques for directly encoding dependencies between data items in a learning algorithm that performs global optimization. Within this framework, this special issue aims at studying, comparing, and reconciling the typical domain/task-specific NLP approaches to structured data with the most advanced machine learning methods. In particular, the selected papers analyze the use of diverse structured input/output approaches, ranging from re-ranking to joint constraint-based global models, for diverse natural language tasks, i.e., document ranking, syntactic parsing, sequence supertagging, and relation extraction between terms and entities. Overall, the experience with this special issue shows that, although a definitive unifying theory for encoding and generating structured information in NLP applications is still far from being shaped, some interesting and effective best practice can be defined to guide practitioners in modeling their own natural language application on complex data.
- Conference Article
1
- 10.1109/csitss.2018.8768761
- Dec 1, 2018
An objective of neural language modelling is to take in the joint probability function of sequence of words in a language. This is characteristically difficult due to the huge computation requirement and curse of dimensionality. A word sequence, the model will encounter during testing is probably going to be not quite the same as all the word sequence seen amid training. Recent works in learning word vector representation are successful in capturing semantic and syntactic relationship between words of a language. These word embeddings are proven to be very efficient in various Natural Language Processing (NLP) tasks like Machine Translation, Question Answering, Text summarization etc. Training word embeddings with neural networks has been prevalent among NLP researchers. Two major models, Continuous Bag of Words (CBOW) and Skip-gram have not only improved the accuracy but also reduced the training time. However, the vector space representation can still be improved using some existing techniques which are rarely used together like subword model, where a word is represented as a weighted average of n-gram representation. Pre-trained word vectors are key requirements in any NLP tasks, generating word vectors for Indian languages has drawn very less attention. This paper proposes a distributed representation for Kannada words using an optimal neural network model and combining various known techniques.
- Research Article
4
- 10.1007/s10994-005-1399-6
- Sep 1, 2005
- Machine Learning
Machine learning techniques have long been the foundations of speech processing. Bayesian classiflcation, decision trees, unsupervised clustering, the EM algorithm, maximum entropy, etc. are all part of existing speech recognition systems. The success of statistical speech recognition has led to the rise of statistical and empirical methods in natural language processing. Indeed, many of the machine learning techniques used in language processing, from statistical part-of-speech tagging to the noisy channel model for machine translation have roots in work conducted in the speech fleld. However, advances in learning theory and algorithmic machine learning approaches in recent years have led to signiflcant changes in the direction and emphasis of the statistical and learning centered research in natural language processing and made a mark on natural language and speech processing. Approaches such as memory based learning, a range of linear classiflers such as Boosting, SVMs and SNoW and others have been successfully applied to a broad range of natural language problems, and these now inspire new research in speech retrieval and recognition. We have seen an increasingly close collaboration between voice and language processing researchers in some of the shared tasks such as spontaneous speech recognition and understanding, voice data information extraction, and machine translation. The purpose of this special issue was to invite speech and language researchers to communicate with each other, and with the machine learning community on the latest machine learning advances in their work. The call for papers was met with great enthusiasm from the speech and natural language community. Thirty six submissions were received; each paper was reviewed by at least three reviewers. Only ten papers were selected re∞ecting not only some of the best work on machine learning in the areas of natural language and spoken language processing but also what we view as a collection of papers that represent current trends in these areas of research both from the perspective of
- Research Article
- 10.52783/jes.1506
- Apr 4, 2024
- Journal of Electrical Systems
This paper describes in detail the Universal Parts of Speech (UPoS) tagged dataset for the Assamese language. PoS tagged dataset in a language is crucial for experimenting and creating resources for various Natural Language Processing (NLP) and AI research. With the growing usage of Universal Dependency standards, tagged dataset with Universal PoS tags are becoming very much essential for contemporary experiments in the NLP community. NLP research in Assamese, and Indo-Aryan language, is relatively new, and the language is considered a Low Resource language. The dataset of UPoS tagged Assamese text is created with an aim of contributing towards enriching this low resource language for NLP and AI tasks. The size of the dataset is 283506 tokens of Assamese vocabulary, against total 20280 sentences, tagged with 17 standard UPoS tags of core lexical categories. The raw data are taken from an open-source corpus originally tagged with BIS tagset. The original size of 453457 tokens against 29504 sentences, after subjected to data filtering, was reduced to this clean resource of 283506 tokens. Lexical categories mapping is done with linguistic expertise, from BIS to UPoS tagsets. Mapped pattern was used for a first-level conversion of BIS tags to UPoS tags. Linguistic validation is also performed with linguistic experts and inter annotator agreement/disagreements were recorded. Second level validation resulted in deciding on the agreement, producing the final version of the dataset. This Assamese UPoS tagged dataset is the first of its kind with UPoS annotations and shall serve a wider Assamese NLP research community for model training using Machine Learning/Deep Learning Techniques.
- Research Article
3
- 10.1007/s43681-024-00606-3
- Nov 27, 2024
- AI and Ethics
Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; (c) OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.
- Supplementary Content
21
- 10.3389/frai.2023.1225093
- Sep 25, 2023
- Frontiers in Artificial Intelligence
Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007 to 2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities.
- Research Article
79
- 10.1145/3593042
- Jul 17, 2023
- ACM Computing Surveys
In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack. Various authors have proposed strong adversarial attacks for computer vision and Natural Language Processing (NLP) tasks. As a response, many defense mechanisms have also been proposed to prevent these networks from failing. The significance of defending neural networks against adversarial attacks lies in ensuring that the model’s predictions remain unchanged even if the input data is perturbed. Several methods for adversarial defense in NLP have been proposed, catering to different NLP tasks such as text classification, named entity recognition, and natural language inference. Some of these methods not only defend neural networks against adversarial attacks but also act as a regularization mechanism during training, saving the model from overfitting. This survey aims to review the various methods proposed for adversarial defenses in NLP over the past few years by introducing a novel taxonomy. The survey also highlights the fragility of advanced deep neural networks in NLP and the challenges involved in defending them.
- Book Chapter
1
- 10.1201/9781003144526-5
- Dec 20, 2021
Over the years, there has been a remarkable interchange between Big Data and Natural Language Processing (NLP) with insights from different computer science field enhancing the development of theory, methodology, and resources. Due to the in-constructed multifaceted nature of normal dialects, numerous regular language errands are not well modeled for scientifically characterized algorithmic solutions. To maintain a strategic distance from this issue, statistical machine learning (ML) approaches are utilized. The rise of Big Data empowers another paradigm for taking care of NLP issues dealing with the unpredictability of the issue space by doling out the intensity of information for building top-notch models. This section gives a prolog to different core NLP errands and features their information-driven arrangements. A couple of delegate NLP applications are depicted that are fabricated utilizing the core NLP assignments as the fundamental foundation. Different wellsprings of Big Data for NLP investigation are examined trailed by Big Data-driven NLP research and applications. At long last, the part finishes up by demonstrating patterns and future research headings [1–9].
- Research Article
11
- 10.1145/3654795
- May 14, 2024
- ACM Computing Surveys
Figurative language generation (FLG) is the task of reformulating a given text to include a desired figure of speech, such as a hyperbole, a simile, and several others, while still being faithful to the original context. This is a fundamental, yet challenging task in Natural Language Processing (NLP), which has recently received increased attention due to the promising performance brought by pre-trained language models. Our survey provides a systematic overview of the development of FLG, mostly in English, starting with the description of some common figures of speech, their corresponding generation tasks, and datasets. We then focus on various modelling approaches and assessment strategies, leading us to discussing some challenges in this field, and suggesting some potential directions for future research. To the best of our knowledge, this is the first survey that summarizes the progress of FLG including the most recent development in NLP. We also organize corresponding resources, e.g., article lists and datasets, and make them accessible in an open repository. We hope this survey can help researchers in NLP and related fields to easily track the academic frontier, providing them with a landscape and a roadmap of this area.
- Research Article
1
- 10.5075/epfl-thesis-7148
- Jan 1, 2016
Word embedding is a feature learning technique which aims at mapping words from a vocabulary into vectors of real numbers in a low-dimensional space. By leveraging large corpora of unlabeled text, such continuous space representations can be computed for capturing both syntactic and semantic information about words. Word embeddings, when used as the underlying input representation, have been shown to be a great asset for a large variety of natural language processing (NLP) tasks. Recent techniques to obtain such word embeddings are mostly based on neural network language models (NNLM). In such systems, the word vectors are randomly initialized and then trained to predict optimally the contexts in which the corresponding words tend to appear. Because words occurring in similar contexts have, in general, similar meanings, their resulting word embeddings are semantically close after training. However, such architectures might be challenging and time-consuming to train. In this thesis, we are focusing on building simple models which are fast and efficient on large-scale datasets. As a result, we propose a model based on counts for computing word embeddings. A word co-occurrence probability matrix can easily be obtained by directly counting the context words surrounding the vocabulary words in a large corpus of texts. The computation can then be drastically simplified by performing a Hellinger PCA of this matrix. Besides being simple, fast and intuitive, this method has two other advantages over NNLM. It first provides a framework to infer unseen words or phrases. Secondly, all embedding dimensions can be obtained after a single Hellinger PCA, while a new training is required for each new size with NNLM. We evaluate our word embeddings on classical word tagging tasks and show that we reach similar performance than with neural network based word embeddings. While many techniques exist for computing word embeddings, vector space models for phrases remain a challenge. Still based on the idea of proposing simple and practical tools for NLP, we introduce a novel model that jointly learns word embeddings and their summation. Sequences of words (i.e. phrases) with different sizes are thus embedded in the same semantic space by just averaging word embeddings. In contrast to previous methods which reported a posteriori some compositionality aspects by simple summation, we simultaneously train words to sum, while keeping the maximum information from the original vectors. These word and phrase embeddings are then used in two different NLP tasks: document classification and sentence generation. Using such word embeddings as inputs, we show that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network. Finding good compact representations of text documents is crucial in classification systems. Based on the summation of word embeddings, we introduce a method to represent documents in a low-dimensional semantic space. This simple operation, along with a clustering method, provides an efficient framework for adding semantic information to documents, which yields better results than classical approaches for classification. Simple models for sentence generation can also be designed by leveraging such phrase embeddings. We propose a phrase-based model for image captioning which achieves similar results than those obtained with more complex models. Not only word and phrase embeddings but also embeddings for non-textual elements can be helpful for sentence generation. We, therefore, explore to embed table elements for generating better sentences from structured data. We experiment this approach with a large-scale dataset of biographies, where biographical infoboxes were available. By parameterizing both words and fields as vectors (embeddings), we significantly outperform a classical model.
- Research Article
21
- 10.1080/0960085x.2020.1816145
- Sep 24, 2020
- European Journal of Information Systems
Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between humans and computers. Although there is a large body of NLP studies published in Information Systems (IS), a comprehensive review of how NLP research is conceptualised and realised in the context of IS has not been conducted. To assess the current state of NLP research in IS, we use a variety of techniques to analyse a literature corpus comprising 356 NLP research articles published in IS journals between 2004 and 2018. Our analysis indicates the need to move from semantics to pragmatics. More importantly, our findings unpack the challenges and assumptions underlying current research trends in NLP. We argue that overcoming these challenges will require a renewed disciplinary IS focus. By proposing a roadmap of NLP research in IS, we draw attention to three NLP research perspectives and present future directions that IS researchers are uniquely positioned to address.
- Conference Article
62
- 10.1145/3411408.3411440
- Sep 2, 2020
Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual BERT-based language model for modern Greek. We evaluate its performance in three NLP tasks, i.e., part-of-speech tagging, named entity recognition, and natural language inference, obtaining state-of-the-art performance. Interestingly, in two of the benchmarks GREEK-BERT outperforms two multilingual Transformer-based models (M-BERT, XLM-R), as well as shallower neural baselines operating on pre-trained word embeddings, by a large margin (5%-10%). Most importantly, we make both GREEK-BERT and our training code publicly available, along with code illustrating how GREEK-BERT can be fine-tuned for downstream NLP tasks. We expect these resources to boost NLP research and applications for modern Greek.
- Research Article
11
- 10.1080/07434619912331278795
- Jan 1, 1999
- Augmentative and Alternative Communication
Historically, there has been little research into the use of natural language processing (NLP) within the context of electronic augmentative and alternative communication (AAC) systems. This is despite the fact that key aspects of AAC research are concerned with the treatment of natural language, and that communication aids appear to represent an ideal means of applying advanced NLP techniques. The lack of NLP research in relation to AAC is partially due to the tendency to focus NLP activities on solving particular problems from constructed examples, rather than the treatment of unrestricted language. Today, however, the face of NLP research has changed significantly, thanks to the increasing availability of and need to process larger corpora. This has prompted a quest for robust solutions to treat unrestricted text, which, in turn, has had two key results: (a) an influx of statistical techniques and (b) the emergence of comprehensive, language-related resources such as broad coverage electronic dictionaries. This paper describes current AAC research that uses NLP and comments on future research directions. Included is a brief survey of AAC systems and research prototypes involving NLP techniques, which is followed by an overview of resources emerging from NLP research that may be applicable to AAC.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.