Bringing data science to the speakers of every language

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Speakers of more than 5,000 languages have access to internet and communication technologies. The majority of phones, tablets and computers now ship with language-enabled capabilities like speech-recognition and intelligent auto-correction, and people increasingly interact with data-intensive cloud-based language technologies like search-engines and spam-filters. For both personal and large-scale technologies, the service quality drops or disappears entirely outside of a handful of languages. Speakers of low-resource languages correlate with lower access to healthcare, education and higher vulnerability to disasters. Serving the broadest possible range of languages is crucial to ensuring equitable participation in the global information economy.

Similar Papers
  • Conference Article
  • 10.1109/icaicta.2015.7335344
Keynote speaker I: Development of Indonesian natural language processing tools and its usage in text applications
  • Aug 1, 2015
  • Ayu Purwarianti

Not only because of the fast growing of internet where we can automatically extract information from unstructured text in documents (text analytics), but also the increasing human need on using computer as simple as possible (conversation), the field of natural language processing has become more interesting in recent years. This phenomenon is also applied for Indonesian language where there is about 250 million people using this language and the neighbor country understands the language. Different with major language such as English or Japanese, the data resource for Indonesian language processing is very limited, most of them were developed by researchers individually. Here, we will describe our ongoing research on building Indonesian Natural Language Processing Tools which we named it INANLP. This tool consists of several natural language processing tools, covers from lexical, syntactic to semantic processing. By the limitation of the data resource and expert knowledge, we employ both the statistical method and rule based method in building the tool. For tools with an adequate expert knowledge, we only employ a rule based method such as for tokenization, stemming, word formalization, semantic analyzer. As for other tools, we employ statistical method with an additional knowledge to handle the OOV (out of vocabulary), such as for POS tagger and Named Entity tagger. As for the parser, until now, we only applied the statistical based, since the POS tagger used in the parser is already designed to handle the OOV. With this limitation, we manage to use INANLP in building text applications including text analytics, text understanding or text conversation. In text analytics, we use INANLP to build text classification and information extraction, for example is a complaint management system, which aim is to automatically extract complaint information (written in social media) from the citizen. Here, we use the tokenization, formalization and named entity tagger of INANLP to build the system. Another application example is in text understanding, where we tried to generate mind map from a text with simple sentences. The INANLP modules used here are tokenization, POS tagger, parser and semantic analyzer. We also built a question answering system which aim is to find answers for given question from unstructured text or structured data. Here we employed tokenization, POS tagger, Named Entity tagger, and parser to build the question answering system.

  • Conference Instance
  • Cite Count Icon 9
  • 10.1145/2513549
Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
  • Oct 28, 2013
  • Xiaozhong Liu + 3 more

It is our great pleasure to welcome you to the 2013 ACM International Workshop on Mining Unstructured Big Data using Natural Language Processing, which will be held at ACM International Conference on Information and Knowledge Management, CIKM 2013. Unstructured text data is heterogeneous and available in different formats, such as text document, scientific publication, web page, and customer comment. The availability of many big unstructured text datasets enables, while also challenges researchers to discover and explore valuable information/knowledge via different techniques. Mining semantics by using Natural Language Processing (NLP) methodologies is an important approach to uncover the "latent knowledge/semantic" of the unstructured text data. In the past decade, while a number of NLP based features already successfully used to enhance the performance of the text mining or information retrieval systems, we are also facing some challenges. For instance, most NLP algorithms' computational cost is high, and we can hardly employ them directly to large-scale text data for online systems. In this workshop, we aggregate different but highly related research communities, i.e., "NLP", "Text Mining" and "IR" researchers, to investigate the possible opportunities and challenges in semantic mining problem. Nine very interesting papers, covering semantic analysis, social media mining, real-time information extraction, and etc., will be presented in this workshop. For this workshop, an opportunity is offered to both NLP and text mining research communities to better clarify the opportunities and challenges in NLP based semantic mining for big unstructured text data with their research experience. We also encourage attendees to attend the keynote presentation - "HathiTrust Data, Opportunities and Challenges for Text Mining and NLP" by Dr. Beth A. Plale, Director of Data to Insight Center, and Professor at School of Informatics and Computing, Indiana University. HathiTrust is a partnership of academic & research institutions, offering a collection of millions of digitized from libraries around the world plus effective API access. We hope that you will find this program interesting and thought-provoking and that the workshop will provide you with a valuable opportunity to share ideas with other researchers and practitioners from institutions around the world.

  • Research Article
  • Cite Count Icon 2
  • 10.1177/13563890251330911
Text as data for evaluation: Natural language processing and large language models to generate novel insights from unstructured text data
  • Jun 13, 2025
  • Evaluation
  • Thomas Wencker + 2 more

Policy formulation and implementation generate large volumes of text. However, since reading all relevant sources is often impossible, evaluators must navigate the complexities of selecting the appropriate technology to efficiently extract meaningful information from growing amounts of unstructured text. Text mining blends interpretative and statistical methods to generate novel insights, potentially contributing to evidence-based policy-making. At the same time, biases, a potential lack of accuracy, explainability, and transparency create ethical concerns and make it necessary to combine natural language processing and human judgment to avoid over-reliance on the capabilities of these methods and, in particular, large language models. This article provides practical guidance on how evaluators can use natural language processing to convert unstructured data from text to structured data. It presents a decision framework that accounts for the characteristics of the data, the nature of the task, and the expected results, facilitating the selection of the appropriate technique.

  • Research Article
  • Cite Count Icon 13
  • 10.14569/ijacsa.2021.0120957
Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique
  • Jan 1, 2021
  • International Journal of Advanced Computer Science and Applications
  • Poornima Kulkarni + 1 more

Personally Identifiable Information (PII) has gained much attention with the rapid development of technologies and the exploitation of information relating to an individual. The corporates and other organizations store a large amount of information that is primarily disseminated in the form of emails that include personnel information of the user, employee, and customers. The security aspects of PII storage have been ignored, raising serious security concerns onindividual privacy. A significant concern arises about comprehending the responsibilities regarding the uses of PII. However, in real-time scenarios, email data is regarded as unstructured text data, detecting PII from such an unstructured large text corpus is quite challenging. This paper presents an intelligent clustering approach for automatically detecting personally identifiable information (PII) from a large text corpus. The focus of the proposed study is to design a model that receives text content and detects possible PII attributes. Therefore, this paper presents a clustering-based PII Model (C-PPIM) based on NLP and unsupervised learning to address detection of PII in the unstructured large text corpus. NLP is used to perform topic modeling, and Byte mLSTM, a different approach of sequence model, is implemented to address clustering problems in PII detection. The performance analysis of the proposed model is carried out existing hierarchical clustering concerning silhouette and cohesion score. The outcome indicatedthe effectiveness of the proposed system that highlights significant PII attributes, with significant scope in real-time implementation. In contrast, existing techniques are too expensive to function and fit in real-time environments.

  • Conference Article
  • Cite Count Icon 1
  • 10.2523/iptc-24706-ms
Use of Natural Language Processing and Computer Vision in Deep Learning for Equipment Failure Investigation on Drilling Tools
  • Feb 17, 2025
  • Junko Hutahaean + 1 more

Incident investigation analysis within the oil and gas industry is a critical process to ensure operational safety, minimize downtime, and improve asset management. However, the sheer volume and heterogeneous nature of data sources (including structured and unstructured text and visual information) present significant challenges to traditional methods of incident classification and contextual understanding and are labor-intensive and error-prone. This paper addresses these challenges by proposing an approach that harnesses natural language processing (NLP) and computer vision techniques in deep learning for equipment failure investigation analysis in drilling tools. The first component of our approach focuses on leveraging NLP for automated incident classification from a mixture of structured and unstructured text data within the oil and gas industry. With vast volumes of data generated from maintenance logs, technician reports, and incident summaries, manual incident classification becomes impractical and error-prone. By applying advanced NLP algorithms, including text mining and sentiment analysis, we automate the process of categorizing incidents, enabling real-time prioritization and deeper semantic analysis. The second component introduces a novel application of computer vision, where we employ deep learning-based techniques to detect and extract textual information from images captured on various electronic boards. By training models on annotated image datasets, our methodology facilitates the extraction of textual content from diverse electronic boards, enriching the incident investigation process with valuable insights. Our NLP methodology analyzes the textual content of diverse data sources and enables rapid identification, categorization, and prioritization of critical incidents. By automating text detection from visual electronic board sources, the computer vision model built in this study enhances incident data collection, improves incident context understanding, facilitates efficient information extraction, and facilitates more accurate root cause analysis. Through empirical validation and case studies, we demonstrate the efficacy and novelty of our integrated approach. Our methodology streamlines incident investigation analysis by automating incident classification and text extraction from visual sources, providing deeper insights into incident contexts, and enabling more informed decision-making. This scalable and effective solution improves incident response, enhances operational safety, and preserves asset integrity within the oil and gas sector, offering a transformative approach to complex incident analysis challenges.

  • Research Article
  • Cite Count Icon 25
  • 10.1088/1742-6596/1018/1/012011
VisualUrText: A Text Analytics Tool for Unstructured Textual Data
  • May 1, 2018
  • Journal of Physics: Conference Series
  • Zuraini Zainol + 2 more

The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3508230.3508250
CBCP: A Method of Causality Extraction from Unstructured Financial Text
  • Dec 17, 2021
  • Lang Cao + 2 more

Extracting causality information from unstructured natural language text is a challenging problem in natural language processing. However, there are no mature special causality extraction systems. Most people use basic sequence labeling methods, such as BERT-CRF model, to extract causal elements from unstructured text and the results are usually not well. At the same time, there is a large number of causal event relations in the field of finance. If we can extract enormous financial causality, this information will help us better understand the relationships between financial events and build related event evolutionary graphs in the future. In this paper, we propose a causality extraction method for this question, named CBCP (Center word-based BERT-CRF with Pattern extraction), which can directly extract cause elements and effect elements from unstructured text. Compared to BERT-CRF model, our model incorporates the information of center words as prior conditions and performs better in the performance of entity extraction. Moreover, our method combined with pattern can further improve the effect of extracting causality. Then we evaluate our method and compare it to the basic sequence labeling method. We prove that our method performs better than other basic extraction methods on causality extraction tasks in the finance field. At last, we summarize our work and prospect some future work.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.ajpc.2021.100300
Leveraging structured and unstructured electronic health record data to detect reasons for suboptimal statin therapy use in patients with atherosclerotic cardiovascular disease
  • Dec 3, 2021
  • American Journal of Preventive Cardiology
  • Glenn T Gobbel + 7 more

Leveraging structured and unstructured electronic health record data to detect reasons for suboptimal statin therapy use in patients with atherosclerotic cardiovascular disease

  • Research Article
  • 10.30574/gscarr.2025.25.3.0392
Predictive Analytics for Supply Chain Resilience: Developing AI Systems to Predict Disruptions
  • Dec 31, 2025
  • GSC Advanced Research and Reviews
  • Praveen Kumar + 1 more

In an era of heightened global interconnectedness and escalating volatility, supply chain resilience has become a strategic imperative for economic stability and national security. Modern supply chains are increasingly exposed to disruptions arising from natural disasters, geopolitical tensions, economic shocks, and global health crises, as starkly demonstrated by the COVID-19 pandemic. These vulnerabilities underscore the urgent need for advanced, predictive mechanisms capable of anticipating disruptions and enabling proactive mitigation. This paper investigates the design and application of artificial intelligence (AI)–driven predictive analytics systems to enhance supply chain resilience. It examines how machine learning, real-time data analytics, and natural language processing (NLP) can be integrated to forecast potential disruptions and support timely decision-making. Building on a comprehensive review of existing literature, the study highlights the evolving role of predictive analytics in supply chain risk management and identifies gaps in current approaches. The proposed framework leverages diverse data sources, including historical supply chain records, weather data, economic indicators, geopolitical developments, and unstructured textual data from news and social media. Multiple machine learning techniques—such as time-series forecasting, classification models, anomaly detection, and NLP—are employed to detect early warning signals and assess disruption severity. Model performance is evaluated using standard metrics to ensure robustness and reliability. Two case studies, focusing on natural disasters and geopolitical risks, demonstrate the practical value of the framework. The findings confirm that AI-driven predictive analytics can significantly improve supply chain resilience by enabling early intervention, risk mitigation, and continuity of operations, while also highlighting implementation challenges and future research directions.

  • Research Article
  • Cite Count Icon 28
  • 10.1007/s00521-024-09532-1
Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text
  • Feb 21, 2024
  • Neural Computing and Applications
  • Serdar Arslan

Named entity recognition (NER) plays a pivotal role in Natural Language Processing by identifying and classifying entities within textual data. While NER methodologies have seen significant advancements, driven by pretrained word embeddings and deep neural networks, the majority of these studies have focused on text with well-defined grammar and structure. A significant research gap exists concerning NER in informal or unstructured text, where traditional grammar rules and sentence structure are absent. This research addresses this crucial gap by focusing on the detection of product names within unstructured Turkish text. To accomplish this, we propose a deep learning-based NER model which combines a Bidirectional Long Short-Term Memory (BiLSTM) architecture with a Conditional Random Field (CRF) layer, further enhanced by FastText embeddings. To comprehensively evaluate and compare our model’s performance, we explore different embedding approaches, including Word2Vec and Glove, in conjunction with the Bidirectional Long Short-Term Memory and Conditional Random Field (BiLSTM-CRF) model. Furthermore, we conduct comparisons against BERT to assess the efficacy of our approach. Our experimentation utilizes a Turkish e-commerce dataset gathered from the internet, where traditional grammatical and structural rules may not apply. The BiLSTM-CRF model with FastText embeddings achieved an F1 score value of 57.40%, a precision value of 55.78%, and a recall value of 59.12%. These results indicate promising performance in outperforming other baseline techniques. This research contributes to the field of NER by addressing the unique challenges posed by unstructured Turkish text and opens avenues for improved entity recognition in informal language settings, with potential applications across various domains.

  • Research Article
  • Cite Count Icon 37
  • 10.1136/amiajnl-2014-003009
Automatic abstraction of imaging observations with their characteristics from mammography reports.
  • Oct 28, 2014
  • Journal of the American Medical Informatics Association : JAMIA
  • Selen Bozkurt + 3 more

Automatic abstraction of imaging observations with their characteristics from mammography reports.

  • Research Article
  • Cite Count Icon 1
  • 10.1042/etls20190003
New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data.
  • Aug 6, 2019
  • Emerging topics in life sciences
  • J Harry Caufield + 1 more

Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.

  • Research Article
  • Cite Count Icon 15
  • 10.2196/12575
Extracting Clinical Features From Dictated Ambulatory Consult Notes Using a Commercially Available Natural Language Processing Tool: Pilot, Retrospective, Cross-Sectional Validation Study
  • Nov 1, 2019
  • JMIR Medical Informatics
  • Jeremy Petch + 3 more

BackgroundThe increasing adoption of electronic health records (EHRs) in clinical practice holds the promise of improving care and advancing research by serving as a rich source of data, but most EHRs allow clinicians to enter data in a text format without much structure. Natural language processing (NLP) may reduce reliance on manual abstraction of these text data by extracting clinical features directly from unstructured clinical digital text data and converting them into structured data.ObjectiveThis study aimed to assess the performance of a commercially available NLP tool for extracting clinical features from free-text consult notes.MethodsWe conducted a pilot, retrospective, cross-sectional study of the accuracy of NLP from dictated consult notes from our tuberculosis clinic with manual chart abstraction as the reference standard. Consult notes for 130 patients were extracted and processed using NLP. We extracted 15 clinical features from these consult notes and grouped them a priori into categories of simple, moderate, and complex for analysis.ResultsFor the primary outcome of overall accuracy, NLP performed best for features classified as simple, achieving an overall accuracy of 96% (95% CI 94.3-97.6). Performance was slightly lower for features of moderate clinical and linguistic complexity at 93% (95% CI 91.1-94.4), and lowest for complex features at 91% (95% CI 87.3-93.1).ConclusionsThe findings of this study support the use of NLP for extracting clinical features from dictated consult notes in the setting of a tuberculosis clinic. Further research is needed to fully establish the validity of NLP for this and other purposes.

  • Research Article
  • Cite Count Icon 64
  • 10.1016/j.ijrobp.2021.01.044
Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer
  • Feb 3, 2021
  • International journal of radiation oncology, biology, physics
  • Danielle S Bitterman + 3 more

Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s41060-025-00750-x
Real-time monitoring of streaming text data by integrating text visualization techniques and natural language processing
  • Mar 31, 2025
  • International Journal of Data Science and Analytics
  • Grigorios Papageorgiou + 2 more

Real-time monitoring of streaming data is crucial, especially when dealing with unstructured text data, which is increasingly prevalent in our daily activities. This type of data requires special attention in various industries. For instance, in the travel and hospitality sector, businesses monitor customer reviews, travel forums, and social media to assess service quality, satisfaction, and emerging trends. In public safety, national security, and biosurveillance, monitoring online forums, social media, and news outlets is essential for identifying threats, criminal activity, or public safety concerns. Additionally, the challenge of efficiently monitoring unstructured text streams, such as business emails, is a significant issue for large organizations. In this paper, we propose a method that combines natural language processing and text visualization techniques with traditional process monitoring algorithms to enhance the analysis and understanding of text streams. Our method involves mapping text streams onto a bivariate plot, followed by the application of monitoring techniques on a sequence of statistics derived from these plots. This sequential analysis reveals valuable insights into temporal patterns and fluctuations. We integrate, in the proposed method, two robust real-time monitoring procedures—Cumulative Sum and Pruned Exact Linear Time—both widely used for change point detection, allowing the framework to support also retrospective analysis for extracting critical insights. The effectiveness of this framework is demonstrated through simulations and two real-world case studies. The first case study involves the BBC, where the framework is used to detect extreme events through published articles. The second, provided by a shipbroker in Greece, applies the framework to monitor a large volume of emails. Ultimately, the proposed solution provides a comprehensive approach to dynamically monitoring time-varying unstructured text streams, offering organizations a powerful tool for informed decision-making and improved market intelligence.

Save Icon
Up Arrow
Open/Close