Court, Judges and the Pandemic: Computational Legal Insights from the Ontario Court of Appeal Corpus 2008-2021
Appellate courts occupy a unique position. They are the final instance for most litigants guiding lower courts but they are also a gateway to the Supreme Court. This dual role calls for special scrutiny and analysis. Yet, data and analysis of appeal courts remains scarce especially compared to apex courts. This article fills part of this gap relating to the Ontario Court of Appeal. It introduces a new dataset of its decisions between 2008-2021 consisting of both metadata, such as outcomes per decision, and the decision full text, which can be mined through natural language processing techniques. Aside from presenting the dataset, the paper uses novel data science approaches to trace the practice of the Court over time, to dissect the decision patterns of its judges, and to assess how the pandemic shock impacted the Court. It finds, amongst others, that the Court has been stable in its decision patterns, but that decisions have grown longer; it also shows that some judges render harsher decisions than others, and it illustrates how the pandemic created instant precedent. We hope that the new dataset and corpus will spur further research on the Ontario Court of Appeal.
- Conference Article
24
- 10.1145/3018896.3036375
- Mar 22, 2017
Natural Language Processing (NLP) techniques show promising results to organize and identify desired information from the bulky raw data. As a result, NLP techniques are continuously getting researcher's attention to automate various software development activities like test cases generation. However, selection of right NLP techniques and tools to generate automated test cases is always challenging. Therefore, in this paper, we investigate the application of NLP techniques to generate test cases from preliminary requirements document. A Systematic Literature Review (SLR) has been conducted to identify 16 research works published during 2005-2014. Consequently, 6 NLP techniques and 18 tools have been identified. Furthermore, 4 test case generation approaches and 9 NLP algorithms have also been presented. The identified NLP techniques and tools are highly beneficial for the researchers and practitioners of the domain.
- Research Article
- 10.52783/jisem.v10i19s.3009
- Mar 12, 2025
- Journal of Information Systems Engineering and Management
Our methodology utilizes a supervised learning approach, employing Random Forest and Gradient Boosting Machines (GBM) trained on a comprehensive dataset that includes email headers, content, and sender behavior. This approach allows our models to discern complex patterns associated with phishing attempts, achieving a 92% detection rate, a substantial improvement over the traditional signature-based methods' 65% rate. Additionally, we integrated NLP techniques, specifically Word2Vec and GloVe, to extract semantic features from email content, enhancing our system's ability to identify malicious intent. The incorporation of NLP not only improves the precision of phishing detection by an additional 15% compared to conventional methods but also emphasizes the importance of semantic analysis in cybersecurity. This enhancement is crucial for understanding the subtle cues within email content that may indicate phishing, offering a more robust and effective defense mechanism for rural areas. By combining supervised learning with quantum computing and NLP, our approach addresses the significant gaps in traditional cybersecurity methods. This multi-layered strategy ensures a more reliable and efficient way to safeguard rural communities from the increasing threat of cyber attacks. The advanced AI techniques employed here leverage both the predictive power of machine learning and the nuanced understanding of language provided by NLP, setting a new standard in cybersecurity practices. The results of our study highlight the effectiveness of the proposed methodology, demonstrating a potential to markedly improve cybersecurity in resource-constrained rural environments. With a 92% phishing detection rate and an increase in precision through the use of NLP, our approach promises a significant advancement in the protection against cyber threats for rural areas, offering a comprehensive and scalable solution. This research presents an innovative multi-layered AI approach, utilizing quantum computing to enhance cybersecurity in rural areas vulnerable to phishing threats. The paper details the integration of sophisticated machine learning techniques—Random Forest and Gradient Boosting Machines (GBM)—with Natural Language Processing (NLP) tools like Word2Vec and GloVe, achieving significant improvements in phishing detection rates. Through a comprehensive analysis of existing cybersecurity strategies and the limitations of traditional signature-based detection methods, this study proposes a robust solution tailored for rural settings such as Siddlagatta, Chikkaballapur, and Devanahalli. By incorporating quantum computing, the approach not only overcomes the constraints of classical computing but also leverages the predictive prowess of AI to offer a more reliable and effective defense against cyber threats. The results demonstrate a promising increase in detection rates, underscoring the potential of this quantum-enhanced, AI-driven strategy to significantly bolster cybersecurity in resource-limited rural environments. Introduction : Cybersecurity in rural areas remains a pivotal concern, exacerbated by limited access to sophisticated technological resources and infrastructure. This paper introduces an advanced multi-layered artificial intelligence (AI) approach, utilizing quantum computing to enhance phishing threat detection in rural environments. Focusing on regions like Siddlagatta, Chikkaballapur, and Devanahalli, the study integrates supervised learning algorithms—Random Forest and Gradient Boosting Machines (GBM)—with Natural Language Processing (NLP) techniques to improve the detection and analysis of phishing attempts. By leveraging machine learning to surpass traditional signature-based methods, this approach significantly boosts detection rates, presenting a tailored, effective solution to protect these vulnerable communities against evolving cyber threats.. Objectives : The objectives of this research are to develop and implement a multi-layered artificial intelligence (AI) approach, utilizing quantum computing to enhance the detection of phishing threats in rural areas. Specifically, the study aims to address the limitations of traditional signature-based detection methods by integrating advanced machine learning algorithms such as Random Forest and Gradient Boosting Machines (GBM) with Natural Language Processing (NLP) techniques. This integration seeks to improve the precision of identifying malicious intent in email communications by analyzing semantic features. The research also explores the effectiveness of these AI techniques in rural settings where cybersecurity resources are scarce, aiming to provide a more robust and efficient solution that can significantly reduce the incidence of phishing attacks in these vulnerable communities. Methods : The proposed methodology entails the development of a web-based platform that melds social networking functionalities with sophisticated agricultural tools and services. By utilizing user profiles, the system effectively categorizes key stakeholders such as farmers, suppliers, experts, and policymakers to foster focused engagement and collaborative efforts. The integration of data from IoT sensors, satellite imagery, and user contributions is channeled into a central system that supports real-time analysis and informed decision-making. Moreover, the platform employs algorithms designed to align stakeholders with pertinent resources, market possibilities, and professional advice. Enhanced communication features like forums, direct messaging, and video conferencing are incorporated to promote interactive exchanges among users. A pilot phase involving select agricultural communities will be initiated to evaluate the practicality and impact of the framework, with subsequent adjustments driven by user feedback and analytic assessments. The ultimate goal of this framework is to boost connectivity, facilitate the efficient distribution of resources, and empower all involved parties through a scalable and intuitive interface. This approach not only aims to revolutionize the way agricultural communities interact and operate but also seeks to provide a robust foundation for continuous growth and innovation in the sector. Results : The simulated results of the study demonstrate a significant enhancement in phishing detection capabilities through the integration of a multi-layered AI approach in rural settings. The deployment of advanced machine learning algorithms, such as Random Forest and Gradient Boosting Machines (GBM), along with Natural Language Processing (NLP) techniques, notably increased the phishing detection rate to 92%, a substantial improvement over the 65% detection rate achieved by traditional signature-based methods. Additionally, the incorporation of NLP through tools like Word2Vec and GloVe improved the precision of identifying malicious intent by an additional 15%, emphasizing the effectiveness of semantic analysis in distinguishing phishing attempts. These results highlight the potential of combining machine learning and quantum computing to address the unique cybersecurity challenges faced in rural areas, providing a robust solution that significantly enhances the detection and prevention of phishing threats.. Conclusions : The research presented in this paper successfully demonstrates the efficacy of a multi-layered AI approach in significantly enhancing cybersecurity against phishing threats in rural areas. By integrating advanced machine learning algorithms with Natural Language Processing techniques and quantum computing, the study achieved a notable increase in phishing detection rates, outperforming traditional signature-based methods with a detection rate of 92%. This approach not only addresses the limitations inherent in existing cybersecurity measures but also tailors its strategy to the unique challenges posed by the limited resources and infrastructure in rural environments. The integration of semantic analysis through NLP further enhanced the precision of threat detection, providing a more nuanced understanding of malicious intent. Overall, the study underscores the potential of sophisticated AI technologies to transform cybersecurity practices in underserved areas, ensuring more effective protection against evolving cyber threats.
- Supplementary Content
- 10.2196/72853
- Aug 14, 2025
- Journal of Medical Internet Research
BackgroundUnstructured patient feedback (UPF) allows patients to freely express their experiences without the constraints of predefined questions. The proliferation of online health care rating websites has created a vast source of UPF. Natural language processing (NLP) techniques, particularly sentiment analysis and topic modeling, are increasingly being used to analyze UPF in health care settings; however, the scope and clinical relevance of these technologies are unclear.ObjectiveThis scoping review investigates how NLP techniques are being used to interpret UPF, with a focus on the health care settings in which this is used, the purposes for using these technologies, and any impacts reported on clinical practice.MethodsSearches of the MEDLINE, Embase, CINAHL, Cochrane Database of Reviews, and Google Scholar were conducted in February 2024. No date limits were applied. Eligibility criteria included English-language studies that used NLP techniques on UPF that pertained to an identifiable health care setting or providers. Studies were excluded if human actors solely performed coding or if NLP was applied to structured feedback or non–patient-generated content. Data were extracted and narratively synthesized regarding health care settings, NLP methods, and clinical applications.ResultsFrom 4017 records, 52 studies met inclusion criteria. NLP was most commonly applied to UPF from secondary care settings (n=33) with fewer in primary (n=10) or community (n=5) care. Three NLP techniques were identified in the included studies: sentiment analysis (n=32), topic modeling (n=15), and text classification (n=7). Sentiment analysis was applied to explore associations between patient sentiment and health care provider characteristics, track emotional responses over time, and identify areas for improvement in health care delivery. Topic modeling, primarily using latent Dirichlet allocation algorithm, was used to uncover latent themes in patient feedback, compare patient experiences across different health care settings, and track changes in patient concerns over time. Text classification was used to categorize patient feedback into predefined topics. The association between NLP-derived insights and traditional health care quality metrics was limited, with few studies describing concrete clinical impacts resulting from their analyses.ConclusionsNLP has been applied to UPF across a number of contexts, primarily to identify features of health services or professionals that support good patient experience. The growth of research publications demonstrates an academic interest in these technologies, but there is little evidence these approaches are being used in clinical settings. Future research is required to assess how NLP may capture the nuance of health care interactions, align with existing quality metrics, and how it may be used to influence clinician behavior.
- Preprint Article
- 10.2196/preprints.72853
- Feb 20, 2025
BACKGROUND Unstructured patient feedback (UPF) allows patients to freely express their experiences without the constraints of predefined questions. The proliferation of online healthcare rating websites has created a vast source of UPF. Natural language processing (NLP) techniques, particularly sentiment analysis and topic modelling, are increasingly being used to analyse UPF in healthcare settings, however the scope and clinical relevance of these technologies is unclear. OBJECTIVE This scoping review investigates how NLP techniques are being used to interpret UPF, with focus on the healthcare settings in which this is used, the purposes for using these technologies, and any impacts reported on clinical practice. METHODS Searches of the MEDLINE, EMBASE, CINAHL, Cochrane Database of Reviews, and Google Scholar were conducted in February 2024. No date limits were applied. English language studies that used NLP techniques on UPF that pertained to an identifiable health care setting or provider were included. Data extraction focused on the healthcare setting, NLP methods used, and applications of these techniques. RESULTS 52 studies were included. NLP was most commonly applied to UPF from secondary care settings (n=33) with fewer in primary (n=10) or community (n=5) care. Three NLP techniques were identified in the included studies: sentiment analysis (n=32), topic modelling (n=15) and text classification (n=7). Sentiment analysis was applied to explore associations between patient sentiment and healthcare provider characteristics, track emotional responses over time, and identify areas for improvement in healthcare delivery. Topic modelling, primarily using Latent Dirichlet Allocation (LDA) algorithm, was employed to uncover latent themes in patient feedback, compare patient experiences across different healthcare settings, and track changes in patient concerns over time. Text classification was used to categorize patient feedback into predefined topics. The association between NLP-derived insights and traditional healthcare quality metrics was limited, with few studies describing concrete clinical impacts resulting from their analyses. CONCLUSIONS NLP has been applied to UPF across a number of contexts, primarily to identify features of health services or professionals that support good patient experience. The growth of research publications demonstrates an academic interest in these technologies, but there is little evidence these approaches are being employed in clinical settings. Future research is required to assess how NLP may capture the nuance of healthcare interactions, align with existing quality metrics and how it may be used to influence clinician behaviour
- Book Chapter
2
- 10.1201/9781003132110-7
- Feb 4, 2022
The importance and usage of natural language processing (NLP) have grown a lot in the field of the medical domain for taking various clinical data for several clinical studies and clinical trials. By performing the trails much advancement was developed. Generally, NLP techniques were designed for developing word- and sentence-based searches and getting the best result as per the search criteria, for example, using keywords like disease names, medicine names, side effects of a particular drug or suggesting the drug based on symptoms of a person. Electronic health records (EHR) play a very major role in storing the patient’s medical records from time to time when they visit various doctors. The main advantage of EHR is it can track the history of the health records very easily. Based on the NLP and EHR techniques, general notes and suggestions will be given to the doctor for making the task simpler, and using this keyword search technique provides many advantages such as reducing time for disease identification, helping doctors make the correct decision, affording time for more patients, etc. Even though the NLP technique is performing such numerous things, there are some challenges to using the NLP technique in the medical domain where it needs to improve. For the EHR technique, many technical challenges have to be overcome such as resistance, performance, effectiveness in generating results, etc. Here in this chapter we are presenting a complete survey of NLP with its limitations and also how NLP is showing efficient results in the medical domain.
- Research Article
16
- 10.1111/lang.12243
- Jun 1, 2017
- Language Learning
Language acquisition occupies a central place in the study of human cognition, and research on how we learn language can be found across many disciplines, from developmental psychology and linguistics to education, philosophy, and neuroscience. It is a very challenging topic to investigate given that the learning target in first and second language acquisition is highly complex, and part of the challenge consists in identifying how different domains of language are acquired to form a fully functioning system of usage (Ellis, 2017). Correspondingly, the evidence about language use and language learning is generally shaped by many factors, including the characteristics of the task in which the language is produced (Alexopoulou, Michel, Murakami, & Meurers, 2017). The challenge is further complicated by the fact that language acquisition is affected by individual learner characteristics. Individual differences are particularly well studied for second language acquisition, where it is clear that factors such as native language, type of instruction, and motivation affect learning rate and ultimate attainment (Ushioda & Dörnyei, 2012; Williams, 2012). But recent research indicates that there is also considerable individual variation in child language development (see Rowland, 2013). To develop an understanding of language acquisition, we need to take into account these individual differences (MacWhinney, 2017). Despite these and other challenges, the past decades have witnessed significant progress in our understanding of how children and adults learn languages. The conceptual and empirical progress arguably is fueled by an increasing range of methods and approaches that are being used to study language acquisition (see Hoff, 2011; Mackey & Gass, 2012). For example, experimental approaches using artificial or natural languages have made it possible to investigate how changes across exposure conditions such as input frequency, instruction type, or prior knowledge affect learning in rigorously controlled environments. Learner corpora are growing in size and task types covered, with increasingly rich annotation supporting detailed analyses employing sophisticated statistical methods. Digital learning environments integrating computational methods hold the promise of supporting the systematic exploration of learning mechanisms in authentic teaching and learning, providing new sources of evidence on the roles played by the linguistic environment, interaction, and feedback in learning. The investigation of a complex phenomenon like language acquisition can significantly benefit from insights, tools, and methods from many disciplines, yet it is still relatively rare to find studies that combine multiple approaches. The research described in Monaghan and Mattock (2012), Ellis, Römer, and O'Donnell (2016), and Christiansen and Chater (2016) transparently illustrates the potential of multimethod approaches to language. For example, Monaghan and Mattock's (2012) investigation of word learning is an excellent illustration of how corpus research can connect with experimental research. Monaghan and Mattock first conducted corpus analyses of child-directed speech. They then used the information derived from these analyses to construct an artificial language that is based on natural language statistics. On this basis, they investigated the acquisition of nouns and verbs by adult learners in an artificial language experiment. While artificial language research is occasionally criticized for its limited ecological validity, the use of distributional information from natural language corpora in the artificial language construction mitigates some of this criticism (see also Monaghan & Rowland, 2017). Another impressive example of multimethod research is Ellis et al. (2016), who investigate the acquisition, processing, and use of verb-argument constructions (VACs), and their monograph contains a series of behavioral experiments, large-scale corpus analyses supported by natural language processing (NLP) techniques, and several computational simulations (connectionist and agent based). The result of this systematic multimethod exploration is a significant, in-depth understanding of how we learn, process, and use VACs—and a research model for others to follow suit. Finally, Christiansen and Chater's (2016) theoretical framework for understanding language acquisition, evolution, and processing is the direct result of multimethod research and would not be possible without the insights the authors gained from working at the intersection of experimental, computational, and corpus-based approaches for more than two decades. The question of how to promote multidisciplinary research across methodological boundaries has been central to the work of the three editors of this volume. A series of review articles aiming to connect research areas and introduce methodologies exemplify this (e.g., Meurers, 2012, 2015; Meurers & Dickinson, 2017; Rebuschat, 2013). One of the editors, Tony McEnery, directs the ESRC Centre for Corpus Approaches to Social Sciences (CASS, http://cass.lancs.ac.uk) at Lancaster University, whose primary objective is to enable colleagues in other, nonlinguistic disciplines to utilize the corpus approach. The two other editors are part of Tübingen's unique LEAD Graduate School & Research Network, which brings together over 130 scientists from education, psychology, linguistics, neuroscience, informatics, sociology, and economics to investigate learning and educational achievement.1 The LEAD initiative includes an interdisciplinary research and training program for doctoral students and postdocs, which is funded by Germany's Excellence Initiative. In the same spirit, we have enjoyed organizing numerous symposia, workshops, summer schools, and conferences, and we have edited several books and special journal issues with the specific aim of bringing together leading researchers from different disciplines whose paths would normally not cross (e.g., Andringa & Rebuschat, 2015; Meurers, 2009; Rebuschat, 2015; Rebuschat, Rohrmeier, Hawkins, & Cross, 2012; Rebuschat & Williams, 2012). This special issue is part of this ongoing effort. This special issue was inspired by a symposium on “Connecting Data and Theory: Corpora and Second Language Research,” which was organized by the editors and took place in Lancaster, UK, on July 19, 2015. The symposium was jointly funded by the Language Learning Roundtable Grant Program and by CASS. The objective was to establish a dialogue between experts on second language acquisition, corpora, and computational analysis methods. This dialogue can significantly enrich the empirical basis of second language research but, to date, collaborations across these fields are still rare. The symposium aimed at directly addressing this shortcoming. There were three sessions, each approaching the symposium topic from a distinct research area. Nick Ellis and Brian MacWhinney provided the view from cognitive psychology, Detmar Meurers and Markus Dickinson the view from computational linguistics, and Anke Lüdeling and Sylviane Granger the view from corpus linguistics. The symposium concluded with a general discussion. The discussion and feedback were both very positive and lively, and when the opportunity arose to produce a special issue on “Currents in Language Learning,” we readily agreed to do so. Five presentations of the symposium provided the basis for four expanded and updated articles (Ellis; Lüdeling et al.; MacWhinney; Meurers & Dickinson). Additional chapters were written by colleagues who attended the symposium and made thoughtful contributions (Alexopoulou et al.; Gablasova et al; Monaghan & Rowland; Ziegler et al.). Based on the symposium discussions, we decided to expand the scope for the special issue in two areas. We solicited an article that would contribute a language testing angle (Wisniewski) and broadened the topic to language learning in general, given the long and fruitful tradition of using corpora, NLP tools, and computational modeling in child language research. As a result, the third issue of the “Currents in Language Learning” series brings together leading researchers in cognitive psychology, computational linguistics, corpus linguistics, developmental psychology, and linguistics. Our contributors were asked to (i) discuss recent work and trends, (ii) outline opportunities and challenges of combining multiple approaches, and (iii) propose directions for future research at the intersection of experimental, computational, and corpus-based approaches to language learning. Each submission was peer reviewed by several anonymous reviewers and by the editors. In the first article, Padraic Monaghan and Caroline Rowland describe the challenges of combining experimental, computational, and corpus approaches to research in child language acquisition. Their article clearly articulates the benefits of multidisciplinary approaches by providing three examples for a successful combination of methods (grammatical category acquisition, morphological development, and the acquisition of sentence structure). On this basis, they conclude with a discussion of future directions. In the second article, Nick Ellis approaches the topic from the perspective of usage-based linguistics. Ellis clearly illustrates the essential contributions made by experimental, computational, and corpus-based research to the establishment of usage-based theories of language (see also Ellis, Römer, & O'Donnell, 2016). In the next article, Detmar Meurers and Markus Dickinson provide a comprehensive review of how computational linguistics and NLP techniques can contribute to our understanding of second language learning. They focus on two contributions: First, computational linguistics can enrich the options for obtaining substantial amounts of data for language learning research, including data obtained via intelligent computer-assisted language learning (ICALL) interfaces (see also Ziegler et al., 2017). Second, NLP techniques can support the identification and interpretation of data of relevance to second language research via automatic linguistic annotation of large-scale corpora—which they argue requires more cross-disciplinary discussion to operationalize relevant learner language distinctions and develop annotation schemes that are adequate to support second language research. The next three articles focus on essential methodological considerations arising from corpus-based language learning research. Anke Lüdeling, Hagen Hirschmann, and Anna Shadrova illustrate how learner corpus data can be used to investigate acquisition patterns by concentrating on second language morphological productivity as a test case. They raise methodological points regarding linguistic modeling, the formation of target hypotheses, and error annotation. Dana Gablasova, Vaclav Brezina, and Tony McEnery focus on collocations in language learning research. The interest in formulaic language has been growing in both first and second language research, and there is now a considerable number of experimental and corpus-based studies in this area (e.g., Christiansen & Arnon, in press). Gablasova et al. critically review measures of association that are frequently used to identify collocations (t score, MI score, Log Dice) and discuss how a better understanding of these measures greatly facilitates the interpretation of trends in language production data. In the sixth article, the same authors focus on the role of corpus-based frequency information for advancing our understanding of how languages are learned. They illustrate the issues involved in the interpretation and comparison of corpus frequencies by contrasting several first and second language corpora. The next two articles provide concrete examples of the benefits of working at the intersection of experimental, computational, and corpus-based approaches to language learning. Dora Alexopoulou, Marije Michel, Akira Murakami, and Detmar Meurers test hypotheses derived from instructed second language acquisition research and task-based language teaching by applying techniques from computational linguistics to a very large learner corpus. They analyze the texts in the EF-Cambridge Open Language Database (https://corpus.mml.cam.ac.uk/efcamdat), a learner corpus that contains over 70,000,000 words collected through an online language learning platform. Their article demonstrates how large corpora and NLP techniques can contribute to contemporary language learning research by complementing experimental evidence. Nicole Ziegler, Detmar Meurers, Patrick Rebuschat, Simón Ruiz, José L. Moreno-Vega, Maria Chinkina, Wenjing Li, and Sarah Grey combine theoretical and methodological insights from second language acquisition, NLP, and ICALL research to investigate the effectiveness of input enhancement in promoting second language development. Their study is experimental, but data are collected via a Web-based ICALL system (WERTi, http://purl.org/icall/werti) that provides computerized pedagogical treatment of learner-selected texts and automatically tracks and collects learners’ action and engagement with the input. This results in a particularly rich data set, beyond what is typically available via traditional experimental approaches. In the next article, Katrin Wisniewski provides a conceptual review of how learner corpora can contribute to language testing research, emphasizing the importance of empirical scale validity. Wisniewski focuses on the Common European Framework of Reference, the most common European reference tool to describe levels of foreign language proficiency, and explicitly works out the opportunities and challenges of working across disciplinary and methodological boundaries. The issue concludes with an important call for the construction of a shared platform to study second language acquisition. Brian MacWhinney argues that further advancement of second language acquisition theory and practice requires a combination of experimental data, a better understanding of how individual differences impact learning, and corpus data that permit the investigation of acquisition patterns. The proposed platform would facilitate this by enabling the collection of substantial amounts of learner data online and by establishing a common protocol on how to share the data—in line with the Child Language Data Exchange System, the central repository for child language data that contributed greatly to our understanding of how children learn language (see Monaghan & Rowland, 2017). The success of such an approach rests on researchers across the world sharing data and agreeing on common protocols for adding and retrieving data. The special issue, and the symposium on which it was based, would not have been possible without the essential support and contributions of many colleagues. We are grateful to our symposium presenters and delegates for making it such a successful event, and we thank our authors for submitting excellent manuscripts for this special issue. We are indebted to the anonymous peer reviewers, who thoroughly assessed the texts and provided very valuable feedback, especially on how to make the contributions accessible and relevant across disciplines. At Language Learning, we are particularly grateful to Nick Ellis (General Editor) and Pavel Trofimovich (Journal Editor) for their sustained support throughout this project, and to Izzat Ibrahim for his friendly assistance in the production of this special issue. At Lancaster and Tübingen, we are very grateful to Lisa Becker and Abi Hawtin for their help in copyediting the volume and to Katarina Pardula for her support in organizing the symposium. Finally, we would like to gratefully acknowledge the financial support of the ESRC Centre for Corpus Approaches to Social Science and Language Learning's Roundtable Grant Program, without which neither the symposium nor the special issue would have been possible.
- Book Chapter
2
- 10.4018/979-8-3693-2165-2.ch002
- Apr 19, 2024
The AI voice assistant mobile application was developed to aid drivers in operating their mobile phones while driving without touching their phones. The literature review examines multiple innovative artificial technologies involved in applications with voice assistants in natural language processing (NLP) techniques. The methodology used involved a qualitative approach, and the design science paradigm was used for the development of the voice assistant for smartphones with NLP techniques. NLP techniques that were applied in the development of the AI voice assistant are smart synthesis, data flow sequence, core and interface accessing, part of speech tagging, named entity recognition, conference resolution, and porter stemming. Some of the operations that are achieved by the application include arithmetic calculations based on voice commands and returning the computer result via voice, searching the internet based on user voice input, and providing a response via voice assistance.
- Research Article
- 10.17862/cranfield.rd.10066229.v1
- Nov 19, 2019
As machine learning becomes more common in defence and security, there is a real risk that the low accessibility of techniques to non-specialists will hinder the process of operationalising the technologies. This poster will present a tool to support a variety of Natural Language Processing (NLP) techniques including the management of corpora – data sets of documents used for NLP tasks, creating and training models, in addition to visualising the output of the models. The aim of this tool is to allow non-specialists to exploit complex NLP techniques to understand the content of large volumes of reports.NLP techniques are the mechanisms by which a machine can process and analyse text written by humans. These methods can used for a range of tasks including categorising documents, translation and summarising text. For many of these tasks the ability to process and analyse large corpora of text is key. With current methods, the ability to manage corpora is rarely considered, instead relying on researchers and practitioners to do this manually in their file system. To train models, researchers use ad-hoc code directly, writing scripts or code and compiling or running them through an interpreter. These approaches can be a challenge when working in multidisciplinary fields, such as defence and security and cyber security. This is even more salient when delivering research where outputs may be operationalised and the accessibility can be a limiting factor in their deployment and use.We present a web interface that uses an asynchronous service-based architecture to enable non-specialists to easily manage multiple large corpora and create and operationalise a variety of different models – at this early stage we have focussed on one NLP technique, that of topic models.This tool-support has been created as part of a project considering the use of NLP to better understand reports of insider threat attacks. These are security incidents where the attacker is a member of staff or another trusted individual. Insider threat attacks are particularly difficult to defend against due to the level of access these individuals gain during the regular course of their employment. The wider use of these techniques would generate greater impact both tactically in defending against these attacks and strategically in developing policy and procedures. There are tools available, however they are often complex and perform a single-task, limiting their use. To generate maximum impact from our research we have developed this web-based software to make the tools more accessible, especially to non-specialist researchers, customers and potential users.
- Research Article
90
- 10.1109/access.2021.3070606
- Jan 1, 2021
- IEEE Access
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Context:</i> User stories have been widely accepted as artifacts to capture the user requirements in agile software development. They are short pieces of texts in a semi-structured format that express requirements. Natural language processing (NLP) techniques offer a potential advantage in user story applications. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Objective:</i> Conduct a systematic literature review to capture the current state-of-the-art of NLP research on user stories. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Method:</i> The search strategy is used to obtain relevant papers from SCOPUS, ScienceDirect, IEEE Xplore, ACM Digital Library, SpringerLink, and Google Scholar. Inclusion and exclusion criteria are applied to filter the search results. We also use the forward and backward snowballing techniques to obtain more comprehensive results. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Results:</i> The search results identified 718 papers published between January 2009 to December 2020. After applying the inclusion/exclusion criteria and the snowballing technique, we identified 38 primary studies that discuss NLP techniques in user stories. Most studies used NLP techniques to extract aspects of who, what, and why from user stories. The purpose of NLP studies in user stories is broad, ranging from discovering defects, generating software artifacts, identifying the key abstraction of user stories, and tracing links between model and user stories. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Conclusion:</i> NLP can help system analysts manage user stories. Implementing NLP in user stories has many opportunities and challenges. Considering the exploration of NLP techniques and rigorous evaluation methods is required to obtain quality research. As with NLP research in general, the ability to understand a sentence’s context continues to be a challenge.
- Research Article
31
- 10.1016/j.ijmedinf.2022.104779
- Apr 26, 2022
- International journal of medical informatics
Applications of natural language processing in radiology: A systematic review
- Conference Article
- 10.1109/argencon.2014.6868539
- Jun 1, 2014
The inspection of documents written in natural language with computers has become feasible thanks to the advances in Natural Language Processing (NLP) techniques. However, certain applications require a deeper semantic analysis of the text to produce good results. In this article, we present an exploratory study of semantic-aware NLP techniques for discovering latent concerns in use case specifications. For this purpose, we propose two NLP techniques, namely: semantic clustering and semantically-enriched rules. After evaluating these two techniques and comparing them with a technique developed by other researchers, results have showed that semantic NLP techniques hold great potential for detecting candidate concerns. Particularly, if these techniques are properly configured, they can help to reduce the efforts of requirement analysts and promote better quality in software development.
- Research Article
117
- 10.1109/access.2022.3183083
- Jan 1, 2022
- IEEE Access
Every year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it is important to assess this issue from different perspectives. None of the surveys have ever comprehensively studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that shed light on the use of NLP techniques for classification and training purposes, while exploring a few alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing email detection using NLP, machine learning algorithms used in phishing detection email, text features in phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria. The findings include that the main research area in phishing detection studies is feature extraction and selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings. Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is expected that the findings of this paper can be helpful for the scientific community, especially in the field of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works to their openly available tools and resources. The analysis of the presented works revealed that not much work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open issues are associated with Arabic phishing email detection.
- Book Chapter
20
- 10.1007/978-3-319-30319-2_3
- Jan 1, 2016
Due to the growing volume of available textual information, there is a great demand for Natural Language Processing (NLP) techniques that can automatically process and manage texts, supporting the information retrieval and communication in core areas of society (e.g. healthcare, business, and science). NLP techniques have to tackle the often ambiguous and linguistic structures that people use in everyday speech. As such, there are many issues that have to be considered, for instance slang, grammatical errors, regional dialects, figurative language , etc. Figurative Language (FL), such as irony , sarcasm , simile, and metaphor, poses a serious challenge to NLP systems. FL is a frequent phenomenon within human communication, occurring both in spoken and written discourse including books, websites, fora, chats, social network posts, news articles and product reviews. Indeed, knowing what people think can help companies, political parties, and other public entities in strategizing and decision-making polices. When people are engaged in an informal conversation, they almost inevitably use irony (or sarcasm) to express something else or different than stated by the literal sentence meaning. Sentiment analysis methods can be easily misled by the presence of words that have a strong polarity but are used sarcastically, which means that the opposite polarity was intended. Several efforts have been recently devoted to detect and tackle FL phenomena in social media. Many of applications rely on task-specific lexicons (e.g. dictionaries, word classifications) or Machine Learning algorithms. Increasingly, numerous companies have begun to leverage automated methods for inferring consumer sentiment from online reviews and other sources. A system capable of interpreting FL would be extremely beneficial to a wide range of practical NLP applications. In this sense, this chapter aims at evaluating how two specific domains of FL, sarcasm and irony, affect Sentiment Analysis (SA) tools. The study’s ultimate goal is to find out if FL hinders the performance (polarity detection) of SA systems due to the presence of ironic context. Our results indicate that computational intelligence approaches are more suitable in presence of irony and sarcasm in Twitter classification.
- Research Article
2
- 10.2196/44191
- Jun 12, 2023
- JMIR AI
Aspirin-exacerbated respiratory disease (AERD) is an acquired inflammatory condition characterized by the presence of asthma, chronic rhinosinusitis with nasal polyposis, and respiratory hypersensitivity reactions on ingestion of aspirin or other nonsteroidal anti-inflammatory drugs (NSAIDs). Despite AERD having a classic constellation of symptoms, the diagnosis is often overlooked, with an average of greater than 10 years between the onset of symptoms and diagnosis of AERD. Without a diagnosis, individuals will lack opportunities to receive effective treatments, such as aspirin desensitization or biologic medications. Our aim was to develop a combined algorithm that integrates both natural language processing (NLP) and machine learning (ML) techniques to identify patients with AERD from an electronic health record (EHR). A rule-based decision tree algorithm incorporating NLP-based features was developed using clinical documents from the EHR at Mayo Clinic. From clinical notes, using NLP techniques, 7 features were extracted that included the following: AERD, asthma, NSAID allergy, nasal polyps, chronic sinusitis, elevated urine leukotriene E4 level, and documented no-NSAID allergy. MedTagger was used to extract these 7 features from the unstructured clinical text given a set of keywords and patterns based on the chart review of 2 allergy and immunology experts for AERD. The status of each extracted feature was quantified by assigning the frequency of its occurrence in clinical documents per subject. We optimized the decision tree classifier's hyperparameters cutoff threshold on the training set to determine the representative feature combination to discriminate AERD. We then evaluated the resulting model on the test set. The AERD algorithm, which combines NLP and ML techniques, achieved an area under the receiver operating characteristic curve score, sensitivity, and specificity of 0.86 (95% CI 0.78-0.94), 80.00 (95% CI 70.82-87.33), and 88.00 (95% CI 79.98-93.64) for the test set, respectively. We developed a promising AERD algorithm that needs further refinement to improve AERD diagnosis. Continued development of NLP and ML technologies has the potential to reduce diagnostic delays for AERD and improve the health of our patients.
- Research Article
2
- 10.1186/s12911-025-02851-w
- Jan 13, 2025
- BMC medical informatics and decision making
Anhedonia and suicidal ideation are symptoms of major depressive disorder (MDD) that are not regularly captured in structured scales but may be captured in unstructured clinical notes. Natural language processing (NLP) techniques may be used to extract longitudinal data on suicidal behaviors and anhedonia within unstructured clinical notes. This study assessed the accuracy of using NLP techniques on electronic health records (EHRs) to identify these symptoms among patients with MDD. EHR-derived, de-identified data were used from the NeuroBlu Database (version 23R1), a longitudinal behavioral health real-world database. Mental health clinicians annotated instances of anhedonia and suicidal symptoms in clinical notes creating a ground truth. Interrater reliability (IRR) was calculated using Krippendorff's alpha. A novel transformer architecture-based NLP model was trained on clinical notes to recognize linguistic patterns and contextual cues. Each sentence was categorized into one of four labels: (1) anhedonia; (2) suicidal ideation without intent or plan; (3) suicidal ideation with intent or plan; (4) absence of suicidal ideation or anhedonia. The model was assessed using positive predictive values (PPV), negative predictive values, sensitivity, specificity, F1-score, and AUROC. The model was trained, tested, and validated on 2,198, 1,247, and 1,016 distinct clinical notes, respectively. IRR was 0.80. For anhedonia, suicidal ideation with intent or plan, and suicidal ideation without intent or plan the model achieved a PPV of 0.98, 0.93, and 0.87, an F1-score of 0.98, 0.91, and 0.89 during training and a PPV of 0.99, 0.95, and 0.87 and F1-score of 0.99, 0.95, and 0.89 during validation. NLP techniques can leverage contextual information in EHRs to identify anhedonia and suicidal symptoms in patients with MDD. Integrating structured and unstructured data offers a comprehensive view of MDD's trajectory, helping healthcare providers deliver timely, effective interventions. Addressing current limitations will further enhance NLP models, enabling more accurate extraction of critical clinical features and supporting personalized, proactive mental health care.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.