Related Topics
Articles published on Textual Data
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
13483 Search results
Sort by Recency
- New
- Research Article
- 10.1080/02664763.2025.2540380
- Apr 4, 2026
- Journal of Applied Statistics
- Youngsun Kim + 2 more
Topic modeling is a process that discovers key themes in unstructured text data by identifying the distribution of topics and words in a document, revealing hidden dimensions. Latent Dirichlet allocation is a widely used generative probabilistic topic model, but it cannot capture the dependency between topics. Generally, the topics within a document are primarily influenced by its overarching theme which naturally interrelates the topics. Thus, it is imperative to unveil such relationships between the topics. To this end, this study proposes a multilevel topic model (MTM) to unearth the hidden topic dependency in a corpus through multilevel latent structure. The MTM allows word-based topic proportions to vary across the higher-level latent structure. The parameters are estimated with a modified EM algorithm using an upward-downward approach to alleviate the computational complexity. Empirical studies on corpora have also been conducted on the multilevel topic model and the hierarchy of multilevel topic model have been interpreted. These analyses have demonstrated that the proposed multilevel topic model outperforms latent Dirichlet allocation in terms of systematic interpretability.
- New
- Research Article
1
- 10.1016/j.neunet.2025.108218
- Apr 1, 2026
- Neural networks : the official journal of the International Neural Network Society
- Yanglei Gan + 7 more
Optimizing boundary dynamics for nested named entity recognition via semantic refinement and trimming.
- New
- Research Article
1
- 10.1016/j.jcrc.2025.155358
- Apr 1, 2026
- Journal of critical care
- Michael Pratte + 5 more
Can large language models approximate the results of meta-analyses in critical care? A meta-research study.
- New
- Research Article
- 10.1016/j.compbiomed.2026.111599
- Apr 1, 2026
- Computers in biology and medicine
- Clodomir Santana + 19 more
Natural language processing of biomedical text to map and prioritize protein-disease associations in HFpEF.
- New
- Research Article
1
- 10.1016/j.neunet.2025.108385
- Apr 1, 2026
- Neural networks : the official journal of the International Neural Network Society
- Shuxiang Hou + 5 more
AGNER: Agile governance-oriented unified named entity recognition for continual learning with diffusion adaptation.
- Research Article
- 10.9781/ijimai.2026.6499
- Mar 13, 2026
- International Journal of Interactive Multimedia and Artificial Intelligence
- Yixuan Wang + 3 more
Cybersecurity ontology development is typically carried out by cybersecurity experts and ontology engineers. While some existing works focus on extracting cybersecurity knowledge from either textual or structured data, few address the challenge of handling both types of data simultaneously. This paper presents Locust, a tool integrating structured data and domain corpus for comprehensive cybersecurity ontology generation. We use open source cybersecurity specifications as structured input to build the skeleton of the ontology, and use the domain corpus to enrich and finalise the ontology. Additionally, we propose a methodology for filtering and simplifying the ontology using hierarchical clustering and multi-way tree. Experimental results demonstrate the effectiveness of our approach in acquiring a cybersecurity ontology from specific domain data sources. Locust is implemented in Java and is available as an open source tool.
- Research Article
- 10.1038/s41598-026-44046-x
- Mar 13, 2026
- Scientific reports
- Qianan Ai + 1 more
Faced with the increasing contradiction between the elderly transportation and the traffic system in most rural areas, the road infrastructure enhancement, the bus service improvement, and the traffic safety management should be given full consideration. An investigation on current rural transportation infrastructures is first performed in the studied area, including the road network configuration, the traffic facility, the surface pavement, and the bus services. Meanwhile, to better grasp the trip behavior and characteristics, in-depth discussions on the elderly travel demands and experiences are performed based on field observation and public data, where the K-means clustering method is applied to identify different trip groups, and the natural language processing technology is adopted to extract specialized needs from public textual data. Based on the foregoing investigation and analysis, a hierarchical improvement framework for elderly-oriented rural traffic is then proposed, including network planning, transportation management, and facility configuration, where quantification models of evaluation indicators are established considering transit network topology and spatial demand distribution. Through a combined evaluation of qualitative analysis and quantitative analysis, the recommended strategies will greatly enhance the global network accessibility by upgrading the road network and reconstructing the bus network, and improve the trip safety and convenience by optimizing the maintenance works and bus services. Specially, under a three-layer rural bus network architecture, the enhancement rates of service coverage and average accessible distance are expected to be 26.2% and 54.6% respectively, at the expense of a 30.8% increase in the daily operation cost.
- Research Article
- 10.3390/geohazards7010035
- Mar 11, 2026
- GeoHazards
- Adnan Ahmed Abi Sen + 5 more
Smart cities require effective disaster management (like flooding, solar storms, sandstorms, or hurricanes), as it directly impacts people’s lives. The key challenges of disaster management are timely detection and effective notification during the crisis. This research presents a smart multi-layer framework for notification classification and management before and during flooding disasters. The framework includes an early detection module as the main phase in the alerting process. This step depends on an Ensemble Learning (EL) model based on a triad of the three best selected models (Deep Learning (DL), Random Forest (RF), and K-nearest Neighbor (KNN)) to analyze data collected continuously from the Internet of Things (IoT) layer. In the boosting phase, the framework utilizes Large Language Models (LLMs) with DL to analyze social textual crowdsourcing data. The results will enable the framework to identify the most affected areas during a flood. The framework adds a fog computing layer alongside a cloud layer to enable instantaneous processing of user responses and generate specialized alerts based on contextual factors such as location, time, risk level, alert type, and user characteristics. Through testing and implementation, the proposed algorithms demonstrated an accuracy rate of over 98% in detecting threats using a dataset of real, collected weather and flooding data. Additionally, the framework proposes a centralized control panel and a design of a smartphone application that offers essential services and facilitates communication among managed civil defense teams, citizens, and volunteers.
- Research Article
- 10.1177/19322968261422628
- Mar 10, 2026
- Journal of diabetes science and technology
- Rongping Zha + 8 more
Diabetes is a chronic condition requiring long-term management, and continuous health education is vital for improving disease awareness and self-management. Large language models (LLMs), advanced artificial intelligence systems trained on large text data sets, have shown promise in generating diabetes-related educational materials. While LLMs can generate accurate and readable content, most studies focus on general education based on guidelines, rather than tailoring content to individual patients' clinical profiles. This study addresses these gaps by comparing the performance of three major LLMs (ChatGPT-4o, Doubao 1.5, and DeepSeek R1) in generating health education materials for discharged patients with diabetes. Ten de-identified medical records of discharged patients with diabetes were uploaded to the LLMs. Each model generated health education materials based on these records. Experienced diabetes nursing experts evaluated the quality of the generated materials. The comprehensibility scores pass rates for all models were above 70%, with DeepSeek R1 performing the best (P < .01). The actionability scores pass rates were below 70% for all models, with no significant differences (P > .01). Accuracy scores for all models were ≥98%, and there were no significant differences in accuracy (P > .01). Similarly, no significant differences were observed in personalization and effectiveness scores (P > .01). DeepSeek R1 achieved the highest safety score, while Doubao 1.5 had the lowest safety score (P < .01). While ChatGPT-4o, Doubao 1.5, and DeepSeek R1 generate accurate and comprehensible materials, concerns remain regarding their actionability and safety. These findings suggest that LLMs should be used as auxiliary tools in diabetes education, requiring further refinement for personalized and actionable content.
- Research Article
- 10.1007/s41060-026-01066-0
- Mar 9, 2026
- International Journal of Data Science and Analytics
- Lamukanyani Alson Mantshimuli + 1 more
Abstract Most portfolio optimization frameworks assume static objectives and constraints, making them fragile under regime shifts, transaction frictions, and evolving information. Existing LLM-based methods focus on signal generation without governing execution or constraint elasticity, while reinforcement learning approaches often lack transparency and cost discipline. This lack of a unified, interpretable architecture hinders adaptability and accountability during live rebalancing. We address this gap by introducing an agentic portfolio optimization framework that integrates regime-aware convex optimization, LLM-derived sentiment and uncertainty features, and a constrained reinforcement learning controller in a closed loop. The agent senses market and news data, infers regimes, and dynamically adjusts objectives, risk budgets, and position limits, while enforcing friction-aware execution through Sharpe-gated trade activation, partial rebalancing, and turnover budgeting. In a 50-stock S & P 500 portfolio tested under walk-forward evaluation (2021–2025 Q1), agentic portfolios consistently outperform non-agentic benchmarks, with Sharpe ratio gains of up to +0.373 (NSGA-3), persisting net of transaction costs and alongside lower turnover. These results highlight the value of combining forward-looking signals, regime-conditioned intent, and disciplined execution within a unified transparent, adaptive agentic architecture. Future research should explore multi-asset extensions, richer textual and alternative data for regime inference, explainability metrics for LLM-driven signals, and hybrid architectures blending deterministic control with selective policy learning for extreme market conditions.
- Research Article
- 10.3390/app16052570
- Mar 7, 2026
- Applied Sciences
- Sakire Nesli Demircioglu + 2 more
Analyzing customer feedback is critical for identifying unmet needs in product development and innovation processes. However, current studies often focus only on identifying customer-expressed problems, neglecting to systematically match these problems with technological solutions and transform them into potential product features. This study aims to propose a sentiment and semantic analysis-based approach that correlates problems derived from customer feedback with patent-based solutions. The proposed approach utilizes Aspect-Based Sentiment Analysis to identify unmet needs from customer feedback, the BERTopic algorithm to extract solution-oriented themes from patent documents, and short text semantic similarity methods to associate problem-solution pairs. The applicability of the approach is demonstrated using 476 customer product reviews and 3548 patents in the Heating, Ventilation, and Air Conditioning (HVAC) field. The results show that customer-expressed problems can be semantically correlated with patent-based technological solutions, and these matches contribute to the identification of potential product features. The resulting problem-solution matches are structured along technological development horizons and presented as a technology roadmap output. The proposed approach offers a framework supporting systematic problem–solution matching based on sentiment and semantic analysis in technology-intensive sectors with large volumes of unstructured text data.
- Research Article
- 10.1080/00036846.2026.2637817
- Mar 5, 2026
- Applied Economics
- Liying Zhuang + 3 more
ABSTRACT As a critical link between policy orientation and market demand, government innovation procurement has increasingly demonstrated its role in shaping firms’ competitive strategies. Based on over two million government procurement contract records and textual data from corporate annual reports, this study constructs indicators for government innovation procurement and corporate competitive strategies using machine learning and text analysis methods. We empirically examine the impact of government innovation procurement on firms’ competitive strategies. The results show that government innovation procurement facilitates the implementation of firms’ competitive strategies. Mechanism analysis reveals that such procurement enhances firms’ resource acquisition capabilities to promote differentiation strategies; it also drives digital transformation to support cost leadership strategies. Heterogeneity analysis further indicates that the impact of government innovation procurement on competitive strategies varies according to firm characteristics, industry attributes, and types of procurement. This study unveils the interactive logic between government innovation procurement and corporate competitive strategy, offering valuable insights for emerging economies seeking to improve their public procurement systems.
- Research Article
- 10.15620/cdc/174640
- Mar 5, 2026
- National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System
- Matthew Garnett + 2 more
This report identifies the specific drugs most frequently involved in drug overdose deaths in the United States from 2017 through 2023. Data from the 2017-2023 National Vital Statistics System mortality files were linked to literal text data from death certificates. Drug overdose deaths were identified using the International Classification of Diseases, 10th Revision underlying cause-of-death codes X40-X44, X60-X64, X85, and Y10-Y14. Specific drugs were identified by searching three literal text fields of the death certificate: the causes of death from Part I, significant conditions contributing to death from Part II, and the description of how the injury occurred. Contextual information was used to determine drug involvement in the death. Descriptive statistics were calculated for the most frequently mentioned drugs involved in drug overdose deaths. Deaths involving multiple drugs were counted in all relevant drug categories. Among drug overdose deaths with mention of at least one specific drug, the most frequently mentioned drugs during 2017-2023 included: fentanyl, heroin, oxycodone, morphine, methadone, hydrocodone, alprazolam, diphenhydramine, cocaine, methamphetamine, amphetamine, gabapentin, and xylazine. Fentanyl ranked first across all years and was the most common concomitant drug found with other top drugs, ranging from 99.0% of xylazine-involved drug overdose deaths to 48.3% of oxycodone-involved drug overdose deaths. Cocaine and methamphetamine were also frequent concomitant drugs. Trends in age-adjusted rates across the 2017 to 2023 period varied by drug, but notably the rate for heroin-involved deaths sharply declined, while the rate for fentanyl-involved deaths increased and then stabilized between 2022 and 2023. In 2023, the most frequently mentioned drugs in unintentional drug overdose deaths were fentanyl, methamphetamine, and cocaine, while the most frequently mentioned drugs for suicide-related drug overdoses were diphenhydramine, oxycodone, and bupropion. This report identifies patterns in the specific drugs most frequently involved in drug overdose deaths from 2017 through 2023.
- Research Article
- 10.38094/jastt71658
- Mar 3, 2026
- Journal of Applied Science and Technology Trends
- Manas Ranjan Biswal + 1 more
Automatic anomaly detection in video surveillance is crucial for public and private safety. However, it is challenging because of unclear abnormal events, limited labeled data, and mismatches between different types of data. Traditional video anomaly detection methods mainly focus on spatiotemporal visual features. They often ignore semantic information and interactions between different data types. Additionally, many multimodal approaches use basic fusion methods that do not solve the alignment problems between these types of data. To address these issues, we propose a multimodal framework that includes a Hierarchical Multi-scale Temporal Network (H-MSTN). This network models short-, medium-, and long-term dependencies in visual and textual data. A lightweight cross-modal attention module makes sure the semantics align. Meanwhile, a Multimodal Attention-Based Fusion Transformer (MAFT) refines cross-modal representations in real time. We evaluate this framework using the UCF-Crime and XD-Violence benchmarks. The proposed method achieves 92.42% AUC on UCF-Crime and 88.63% AP on XD-Violence with significantly lower computational cost and faster inference than recent multimodal baselines such as ReFLIP-VAD. These results demonstrate a strong efficiency–accuracy trade-off for real-time deployment while maintaining competitive or improved performance over prior methods such as MVAD and TEVAD.
- Research Article
- 10.7717/peerj-cs.3629
- Mar 3, 2026
- PeerJ Computer Science
- Tahani Jaser Alahmadi + 5 more
Emotion recognition plays an important role in a wide range of application domains. Although previous studies have made progress in this domain, they often fall short in achieving a better understanding of emotions and inferring their underlying causes. To address these limitations, we propose an emotion recognition framework that integrates visual, audio, and textual modalities within a unified architecture. The proposed framework integrates an adaptive cross-modal attention module to capture inter-modal interactions. This module dynamically adjusts the contribution of each modality based on contextual relevance, enhancing recognition accuracy. Additionally, an emotion causality inference module uses a fine-tuned, trainable LLaMA2-Chat (7B) model to jointly process image and text data. This identifies word clues associated with the expressed emotions. Furthermore, a real-time emotion feedback module delivers instantaneous assessments of emotional states during conversations, supporting timely and context-aware interventions. The experimental results on four datasets, SEMAINE, AESI, ECF, and MER-2024, demonstrate that our method achieves improvements in F1-scores compared to baselines.
- Research Article
- 10.3390/informatics13030037
- Mar 2, 2026
- Informatics
- Sivachandra K B + 4 more
Causal inference in text data has been a demanding objective in the field of natural language processing, mainly due to the intrinsic ambiguity and context sensitivity inherent in data, inducing uncertainty. Diminishing this uncertainty is essential in identifying reliable causal connections and advancing predictive consistency. In this research, we introduce an uncertainty-aware ensemble architecture that combines multiple text embedding schemes with both linear and nonlinear classifiers to boost causal text detection. Both sparse and neural-level embeddings were employed, and then combined it with an ensemble weighting approach based on two uncertainty estimation techniques, namely entropy-based and KL divergence-based. Unlike conventional ensemble methods with uniform or fixed voting strategies, our approach assigns weights inversely proportional to classifier uncertainty, ensuring that confident models exert greater influence on the final decisions. Our results show that TF-IDF, through its effective word frequency weighting scheme, consistently outperforms other embedding techniques, achieving better performance across both linear and nonlinear classifiers on both datasets (News Corpus and CausalLM–Adjective group). The experimental results show that our uncertainty-aware ensemble approach enhances both calibration and confidence predictions. Entropy-based weighting improves confidence in the case of linear classifiers with accuracy, F1-score, entropy and prediction confidence values of 94.3%, 94.0%, 0.382 and 0.774, respectively, while in the case of nonlinear classifiers the KL divergence-based weighting acquires a better performance with an accuracy of 97.6%, F1-score of 97.2%, KL Mean value of around 0.055 and LogLoss of 0.221.
- Research Article
- 10.1093/jamia/ocaf230
- Mar 1, 2026
- Journal of the American Medical Informatics Association : JAMIA
- Jiwon You + 1 more
Structural insights into clinical large language models and their barriers to translational readiness.
- Research Article
- 10.1063/5.0310465
- Mar 1, 2026
- AIP Advances
- Rongchang Guo + 1 more
By addressing the challenges of sample imbalance and insufficient single-granularity feature extraction in railway signal equipment fault diagnosis, this paper proposes a fault diagnosis method based on sample balancing and multi-granularity feature fusion. First, the fault text data undergoes data cleaning followed by key information extraction to obtain a structured fault representation. Based on this representation, the RoBERTa-wwm model is employed to extract global deep semantic features, generating feature vectors. Second, the Borderline SMOTified GAN (BSMOTEG) method is introduced to augment minority class feature vectors, mitigating class imbalance. Subsequently, the balanced text vectors are fed into a BiLSTM network to capture temporal dependency features, while a multi-head attention mechanism is employed to weight local features. This achieves effective fusion of global-temporal-local multi-granularity features. Finally, a softmax classifier is employed for signal equipment fault diagnosis. Experimental analysis is conducted by using signal equipment failure text data recorded by a railway bureau’s electrical engineering department, with comparisons against other methods. Results demonstrate that the proposed method achieves optimal classification metrics.
- Research Article
- 10.1016/j.ijmedinf.2025.106225
- Mar 1, 2026
- International journal of medical informatics
- Pedro Faustini + 3 more
The digitisation of healthcare has generated vast amounts of data in various formats, including free-text notes, tabular records and medical images. This data is critical for research and innovation, but often contains sensitive information that must be de-identified to ensure patient privacy and regulatory compliance. Natural Language Processing (NLP) enables automated de-identification of sensitive information to safely share medical datasets. This study aims to systematically review the literature on NLP-based de-identification techniques applied to free-text medical reports, tabular data, and burned-in text within medical images over the past decade. It seeks to identify state-of-the-art methods, analyse how de-identification tasks are assessed, and find existing gaps for future research. We systematically searched five important databases (PubMed, Web of Science, DBLP, ACM and IEEE) for articles published from January 2015 to December 2024 (10 years) about de-identification of medical data in free text, tabular data and burned-in pixels in images. We filtered the articles based on their titles and abstracts against inclusion and exclusion criteria, followed by a quality filter. From a set of 734 papers, 83 articles were deemed relevant. Most studies de-identify free text, with a few working with tabular data and a much scarcer number dealing with text embedded in the pixels of the images. De-identification techniques have evolved, with increased use of Language Models and a decline in recurrence-based neural networks. Off-the-shelf tools often require customisation for optimal performance. Most studies de-identify English content, supported by the prevalence of English datasets. Key challenges include the phenomenon of code-mixing (i.e., more than one language used in the same sentence) and the scarcity of available datasets for reproducibility.
- Research Article
- 10.1109/tpami.2025.3630577
- Mar 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Yaofang Hu + 3 more
The expansion of textual data, stemming from various sources such as online product reviews and scholarly publications on scientific discoveries, has created a significant demand for the extraction of succinct yet comprehensive information. While many methods have been proposed for automatic keyword extraction in unsupervised and fully supervised settings, effectively leveraging a partial list of known keywords, such as author-specified keywords or Twitter hashtags, remains under-explored. This work aims to enhance both the effectiveness and scalability of semi-supervised keyword extraction. We propose a novel variational Bayesian semi-supervised (VBSS) method that builds upon recent Bayesian advancement in the field, replacing computationally expensive posterior sampling with variational inference and data augmentation. This leads to closed-form updates and substantial speedups, particularly for long texts. Our numerical results show that the VBSS method not only improves performance on longer texts but also offers better control over false discovery rates compared to state-of-the-art keyword extraction techniques.