Online monitoring of unstructured data in additive manufacturing via an improved manifold learning algorithm

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In today’s digital era of rapid technological advancement, data volume continues to surge, and data structures have become increasingly diverse. Processes in advanced manufacturing, particularly in additive manufacturing (AM), often generate unstructured spatial-signal point-set data, which are collections of spatial coordinates with associated sensor signal data, that exhibit high density and rich fine-grained information, while online monitoring in customized production settings often suffers from limited historical data availability and lacks effective real-time monitoring methods. To address these challenges, we propose Fine-Grained Point-set Distance-based t-distributed stochastic neighbor embedding (FGPDist-t-SNE), which integrates a novel point-set distance metric FGPDist with an enhanced t-SNE manifold learning framework. The fine-grained and rich information of complex unstructured spatial-signal point-set data is preserved as much as possible, enabling effective feature extraction. Under limited training data constraints, our method supports efficient online updates for timely anomaly detection. Taking AM as a representative application scenario, we validate the approach through both simulation and real-case experiments. Comparative analysis with traditional benchmarks demonstrates the superiority of FGPDist-t-SNE in solving online quality monitoring problems of unstructured point-set data.

Similar Papers
  • Discussion
  • Cite Count Icon 24
  • 10.1161/circoutcomes.115.002125
Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
  • Aug 18, 2015
  • Circulation: Cardiovascular Quality and Outcomes
  • Thomas M Maddox + 1 more

The promise of big data has captured healthcare’s imagination. Although the term lacks a consensus definition, it generally refers to electronic health data sets characterized by the 3 Vs: volume, variety, and velocity.1,2 Volume refers to the sheer amount of healthcare data currently generated by clinical operations, administration, and patients themselves. By one estimate, ≈25 000 petabytes of healthcare data will be available by 2020—an amount that could fill 500 billion file cabinets.2 Variety refers to the wide range of healthcare data formats. For example, electronic health records (EHRs) contain both structured and unstructured (or free-text) data, diagnostic images come in a variety of multimedia formats, and patient data are generated from wearables, mobile devices, medical devices, and social media—each with its own format. Velocity refers to the rapidity with which new data are generated, and thus the speed at which it needs incorporation into data sets and analyses to provide real-time insights into health care. Article see p 477 The potential of such data is enormous. Insights from big data could fuel innovation and improvement in clinical operations, research and development, and public health.1 However, the potential of big data to realize these lofty aspirations is matched by the challenge of organizing, analyzing, and generating actionable insights from it. One of the biggest challenges in realizing the potential of big data is in abstracting it. With the passage of the HITECH (The Health Information Technology for Economic and Clinical Health) Act in 2009, the adoption of EHRs in clinical practice has accelerated, and now over half of office-based practices and hospitals are using some form of EHR.3,4 As a result, more point-of-care clinical data, previously inaccessible in its paper format, is potentially available. However, the variety aspect of EHR data—its mix …

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.3390/agriculture10010021
Forecasts of the Amount Purchase Pork Meat by Using Structured and Unstructured Big Data
  • Jan 18, 2020
  • Agriculture
  • Ga-Ae Ryu + 3 more

It is believed that the huge amount of information delivered to the consumers through mass media, including television and social networks, may affect consumers’ behavior. The purpose of this study was to forecast the amount required to purchase pork belly meat by using unstructured data such as broadcast news, TV programs/shows and social network as well as structured data such as consumer panel data, retail and wholesale prices and production outputs in order to prove that mass media data release can occur ahead of actual economic activities and consumer behavior can be predicted by using these data. By using structured and unstructured data from 2010 to 2016 and five forecasting algorithms (autoregressive exogenous model and vector error correction model for time series, gradient boosting and random forest for machine learning, and long short-term memory for recurrent neural network), the amounts required to purchase pork belly meat in 2017 were forecasted and compared with the actual amounts to validate model accuracy. Our findings suggest that when unstructured data were combined with structured data, the forecast pattern is improved. To date, our study is the first report that forecasts the demand of pork meat by using structured and unstructured data.

  • Research Article
  • Cite Count Icon 55
  • 10.1111/ajt.14099
Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.
  • Jan 4, 2017
  • American Journal of Transplantation
  • T.R Srinivas + 9 more

Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/ctceec.2017.8454999
Structured and Unstructured Big Data Analytics
  • Sep 1, 2017
  • Suyash Mishra + 1 more

The volume of data in the world is growing very fast and generated from verity of sources like social media, sensors airline industry or scientific data in different formats. Biggest challenge is how to infer meaningful insights from such a varietyful and big data along with concern of data storage and management of fast growing data. The size of the databases used in today’s enterprises has been growing at exponential rates day by day. Hence, industries requirement to quickly process and analyze the big data volumes for business decision making and customer insights has also grown exponentially. Data pouring from various sources may be can be structured or unstructured in nature. Structured data refers to a relatively well-organized information, which can be further inserted into traditional RDBMS. As Traditional RDBMS are efficient and easy queries by simple, straightforward search algorithms or SQL queries. In contrast to structured data, unstructured data can be considered as information, which does not, comes in a pre-defined data format, well organized data storage model, or cannot be stored well into relational tables. It is assumed to be fastest growing type of data, e.g. image, sensors data, web chats, social networking messaging data, video, documents, log files, and email data. There are many techniques and software available, which can process and provide efficient storage of unstructured data and help organization to perform analytics on unstructured data. Unstructured data does not well-organized and not stored in predefined manner e.g. logs, web chats. The variety and on ordered nature of data makes storage methods and structure makes execution a time and resource-consuming affair. Advancement into technology has open floodgates to push huge volume of unstructured type of data. Multimedia data is one of the example of unstructured big data, which spans all over the Internet. This needs high execution capability to extract useful information. Rapid processing of multimedia data such as video is important for e.g. criminal investigations, surveillance monitoring, news analysis, sports analytics domain, emotion extraction, etc. Hence, analysis of multimedia data in minimum timeframe is one of the latest research areas. Therefore, we have researched techniques for analyzing unstructured data to extract meaningful information hidden in the big data. In addition, we will describe about various techniques and software used to Manage, process unstructured big data in efficient manner, and increases the performance of complexity analysis.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/66910
Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study.
  • Feb 13, 2025
  • Journal of medical Internet research
  • Tom M Seinen + 3 more

Electronic health records (EHRs) consist of both structured data (eg, diagnostic codes) and unstructured data (eg, clinical notes). It is commonly believed that unstructured clinical narratives provide more comprehensive information. However, this assumption lacks large-scale validation and direct validation methods. This study aims to quantitatively compare the information in structured and unstructured EHR data and directly validate whether unstructured data offers more extensive information across a patient population. We analyzed both structured and unstructured data from patient records and visits in a large Dutch primary care EHR database between January 2021 and January 2024. Clinical concepts were identified from free-text notes using an extraction framework tailored for Dutch and compared with concepts from structured data. Concept embeddings were generated to measure semantic similarity between structured and extracted concepts through cosine similarity. A similarity threshold was systematically determined via annotated matches and minimized weighted Gini impurity. We then quantified the concept overlap between structured and unstructured data across various concept domains and patient populations. In a population of 1.8 million patients, only 13% of extracted concepts from patient records and 7% from individual visits had similar structured counterparts. Conversely, 42% of structured concepts in records and 25% in visits had similar matches in unstructured data. Condition concepts had the highest overlap, followed by measurements and drug concepts. Subpopulation visits, such as those with chronic conditions or psychological disorders, showed different proportions of data overlap, indicating varied reliance on structured versus unstructured data across clinical contexts. Our study demonstrates the feasibility of quantifying the information difference between structured and unstructured data, showing that the unstructured data provides important additional information in the studied database and populations. The annotated concept matches are made publicly available for the clinical natural language processing community. Despite some limitations, our proposed methodology proves versatile, and its application can lead to more robust and insightful observational clinical research.

  • Research Article
  • Cite Count Icon 66
  • 10.1097/pec.0000000000000484
An Introduction to Natural Language Processing: How You Can Get More From Those Electronic Notes You Are Generating.
  • Jul 1, 2015
  • Pediatric Emergency Care
  • Amir A Kimia + 3 more

Electronically stored clinical documents may contain both structured data and unstructured data. The use of structured clinical data varies by facility, but clinicians are familiar with coded data such as International Classification of Diseases, Ninth Revision, Systematized Nomenclature of Medicine-Clinical Terms codes, and commonly other data including patient chief complaints or laboratory results. Most electronic health records have much more clinical information stored as unstructured data, for example, clinical narrative such as history of present illness, procedure notes, and clinical decision making are stored as unstructured data. Despite the importance of this information, electronic capture or retrieval of unstructured clinical data has been challenging. The field of natural language processing (NLP) is undergoing rapid development, and existing tools can be successfully used for quality improvement, research, healthcare coding, and even billing compliance. In this brief review, we provide examples of successful uses of NLP using emergency medicine physician visit notes for various projects and the challenges of retrieving specific data and finally present practical methods that can run on a standard personal computer as well as high-end state-of-the-art funded processes run by leading NLP informatics researchers.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.1371/journal.pone.0289795
Finding the best trade-off between performance and interpretability in predicting hospital length of stay using structured and unstructured data.
  • Nov 30, 2023
  • PloS one
  • Franck Jaotombo + 3 more

This study aims to develop high-performing Machine Learning and Deep Learning models in predicting hospital length of stay (LOS) while enhancing interpretability. We compare performance and interpretability of models trained only on structured tabular data with models trained only on unstructured clinical text data, and on mixed data. The structured data was used to train fourteen classical Machine Learning models including advanced ensemble trees, neural networks and k-nearest neighbors. The unstructured data was used to fine-tune a pre-trained Bio Clinical BERT Transformer Deep Learning model. The structured and unstructured data were then merged into a tabular dataset after vectorization of the clinical text and a dimensional reduction through Latent Dirichlet Allocation. The study used the free and publicly available Medical Information Mart for Intensive Care (MIMIC) III database, on the open AutoML Library AutoGluon. Performance is evaluated with respect to two types of random classifiers, used as baselines. The best model from structured data demonstrates high performance (ROC AUC = 0.944, PRC AUC = 0.655) with limited interpretability, where the most important predictors of prolonged LOS are the level of blood urea nitrogen and of platelets. The Transformer model displays a good but lower performance (ROC AUC = 0.842, PRC AUC = 0.375) with a richer array of interpretability by providing more specific in-hospital factors including procedures, conditions, and medical history. The best model trained on mixed data satisfies both a high level of performance (ROC AUC = 0.963, PRC AUC = 0.746) and a much larger scope in interpretability including pathologies of the intestine, the colon, and the blood; infectious diseases, respiratory problems, procedures involving sedation and intubation, and vascular surgery. Our results outperform most of the state-of-the-art models in LOS prediction both in terms of performance and of interpretability. Data fusion between structured and unstructured text data may significantly improve performance and interpretability.

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/bibm49941.2020.9312987
Pneumonia Outcome Prediction Using Structured And Unstructured Data From EHR
  • Dec 16, 2020
  • Cherubin Mugisha + 1 more

In Intensive Care Unit (ICU), it is important to anticipate interventions for patients at a high-risk of death. This requires identifying those patients ideally at the time of their admission in ICU, and update their initial risk rate every time new data is available. This predictive task can be performed by analyzing structured and unstructured routine data to make sure that we can initiate a prediction for every patient. Traditional statistical tools have been used to assess disease like pneumonia and predict the outcome of a patient. Recently, machine learning models emerged and have shown better performances on such tasks. Although authors have published various results, their works rely on a single datatype, either structured or unstructured data. Using the Medical Information Mart for intensive Care data-set, we are proposing an ensemble model, that aggregates different data-types to predict the outcome of a pneumonia patient admitted in ICU using limited data that can be available at the very early stage of his stay. To demonstrate the importance of this approach, we compared it with 2 different models, one based only on structured data, and another one based on narratives text from caregivers, and we were able to show that our ensemble model can perform way better with an accuracy of 0.98 of F1-score(0.97 MCC), while a model using only structured data had 0.79 of F1-score and where text notes predicted the outcome with an accuracy of 0.89 of Matthews Correlation Coefficient. In addition to showing how ensemble learning models can outperform other models on this task, we demonstrated the importance and usefulness of interpreting the predictions pointing out the leading factors that are determining the global and individual outcome predictions.

  • Research Article
  • 10.58776/ijitcsa.v3i3.226
Integrating Structured and Unstructured Data for Enhanced Marketing Intelligence through Text Mining and Business Analytics
  • Dec 18, 2025
  • International Journal of Information Technology and Computer Science Applications
  • Sabreen Hashim Salman

In the digital era, the rapid growth of social media and online platforms has led to an explosion of unstructured textual data that holds significant business value. Traditional marketing strategies, once reliant on structured data such as demographics and purchase history, now benefit from insights derived from text analytics and sentiment analysis. This paper explores the integration of structured and unstructured data to strengthen marketing intelligence and customer segmentation. By utilizing text mining techniques and Natural Language Processing (NLP), unstructured data such as customer reviews and comments can be analyzed to extract sentiments, identify emerging trends, and refine customer relationship strategies. The study proposes an integrated framework that combines data extraction, transformation, and loading (ETL) processes with a data warehouse system for unified analysis. Using clustering algorithms such as K-Means and visualization tools, insights into customer behavior, preferences, and market segmentation are revealed. The paper also discusses the challenges of handling multilingual and context-dependent text, ethical and privacy considerations, and the technical architecture necessary for business intelligence implementation. Findings suggest that effective integration of textual analytics with structured data can lead to more informed decision-making, improved marketing strategies, and stronger customer engagement.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.3390/ijerph20054340
Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method
  • Feb 28, 2023
  • International Journal of Environmental Research and Public Health
  • Chih-Chou Chiu + 5 more

An ICU is a critical care unit that provides advanced medical support and continuous monitoring for patients with severe illnesses or injuries. Predicting the mortality rate of ICU patients can not only improve patient outcomes, but also optimize resource allocation. Many studies have attempted to create scoring systems and models that predict the mortality of ICU patients using large amounts of structured clinical data. However, unstructured clinical data recorded during patient admission, such as notes made by physicians, is often overlooked. This study used the MIMIC-III database to predict mortality in ICU patients. In the first part of the study, only eight structured variables were used, including the six basic vital signs, the GCS, and the patient’s age at admission. In the second part, unstructured predictor variables were extracted from the initial diagnosis made by physicians when the patients were admitted to the hospital and analyzed using Latent Dirichlet Allocation techniques. The structured and unstructured data were combined using machine learning methods to create a mortality risk prediction model for ICU patients. The results showed that combining structured and unstructured data improved the accuracy of the prediction of clinical outcomes in ICU patients over time. The model achieved an AUROC of 0.88, indicating accurate prediction of patient vital status. Additionally, the model was able to predict patient clinical outcomes over time, successfully identifying important variables. This study demonstrated that a small number of easily collectible structured variables, combined with unstructured data and analyzed using LDA topic modeling, can significantly improve the predictive performance of a mortality risk prediction model for ICU patients. These results suggest that initial clinical observations and diagnoses of ICU patients contain valuable information that can aid ICU medical and nursing staff in making important clinical decisions.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-19-0284-0_6
Framework for Context-Based Intelligent Search Engine for Structured and Unstructured Data
  • Jan 1, 2022
  • Ritendra R Sawale + 2 more

In today’s time, determining the user’s exact need through search queries is really a significant challenge as there is a tremendous amount of structure and unstructured data which is being produced daily. In structured data, it is easy to extract what we need as compared to unstructured data. There is a need to understand the semantics of the text given in such textual data. Natural language processing helps us to extract useful information from such unstructured textual data. Word embedding is one of the ways to overcome this issue. The implemented system aims to build a framework that will search based on the context hidden in the user query. As the context of keywords plays a vital role while extracting relevant search results from the database, the implemented system works on determining the context of the keyword in the query by using the GloVe word embedding technique. The embedded query is used to find out the most relevant documents from the database. This database consists of various text documents of different formats like pdf, word files, excel sheets, online crawled data, etc. All this data is stored in a database name as ElasticSearch. The proposed system can be used as an intranet searching system. The most relevant data is sent as an output quickly. The existing entity-based search engine is not having contextual capability which is provided by the implemented system. The result for search queries are based on the combination of entity and context-based search system.KeywordsNatural language processingArtificial intelligenceMachine learningElasticSearch

  • Research Article
  • 10.1002/alz.090057
Pursuing dementia detection for better prevalence estimation with artificial intelligence and traditional modelling techniques in electronic health data
  • Dec 1, 2024
  • Alzheimer's & Dementia
  • Taya Collyer + 16 more

BackgroundPopulation dementia prevalence is traditionally estimated using cohort studies, surveys, routinely‐collected administrative data, and registries. Hospital Electronic Health Records (EHRs) are comprised of rich structured and unstructured (text) clinical data that are underutilised for this purpose.We aimed to develop a suite of algorithms using routinely‐collected EHR data to reliably identify cases of dementia, as a key step towards incorporating such data in prevalence estimation. Towards this, we developed a novel predictive framework integrating data‐science and biostatistical methods.MethodTraining data were sourced via the National Centre for Healthy Ageing (NCHA) Data Platform, a linked, curated, EHR‐derived data warehouse. Individuals within the platform catchment aged >60 years with confirmed dementia were identified through hospital specialist dementia clinics. A comparison group of individuals aged >60years with EHR records without dementia was recruited from the community.A panel of clinical experts (Neurology, Geriatric Medicine) informed variable and concept selection and guided data cleaning efforts within both streams. Algorithms were developed via two work‐streams; a traditional biostatistical approach to fit logistic regression models using structured data elements, and a data science stream used Natural Language Processing (NLP) to fit models to the unstructured (text) parts of the EHR, for the same individuals.ResultOf 568 individuals (362 with dementia), 434 had clinical notes available. In the data science stream using unstructured data, among a range of NLP derived models, the Random Forest classifier performed best in assigning dementia status, with Area Under the Curve (AUC) 0.95, specificity 90.2% and sensitivity 88.4%. In the biostatistics stream, 15 structured variables were included in the final model, covering demographics, health service attendance, medications, and ICD‐10 Codes, with AUC 0.94, specificity 85.9% and sensitivity 85.6%.ConclusionArtificial intelligence techniques applied to unstructured electronic health data and guided by human clinical expertise may be powerful tools in capturing the presence of dementia, at least comparable to traditional techniques using structured data, and conferring practical and scientific advantages for dementia prevalence estimation. Future validation is required in less crisply delineated real‐world settings.

  • Research Article
  • Cite Count Icon 119
  • 10.1111/jgs.15411
The Value of Unstructured Electronic Health Record Data in Geriatric Syndrome Case Identification.
  • Jul 4, 2018
  • Journal of the American Geriatrics Society
  • Hadi Kharrazi + 7 more

To examine the value of unstructured electronic health record (EHR) data (free-text notes) in identifying a set of geriatric syndromes. Retrospective analysis of unstructured EHR notes using a natural language processing (NLP) algorithm. Large multispecialty group. Older adults (N=18,341; average age 75.9, 58.9% female). We compared the number of geriatric syndrome cases identified using structured claims and structured and unstructured EHR data. We also calculated these rates using a population-level claims database as a reference and identified comparable epidemiological rates in peer-reviewed literature as a benchmark. Using insurance claims data resulted in a geriatric syndrome prevalence ranging from 0.03% for lack of social support to 8.3% for walking difficulty. Using structured EHR data resulted in similar prevalence rates, ranging from 0.03% for malnutrition to 7.85% for walking difficulty. Incorporating unstructured EHR notes, enabled by applying the NLP algorithm, identified considerably higher rates of geriatric syndromes: absence of fecal control (2.1%, 2.3 times as much as structured claims and EHR data combined), decubitus ulcer (1.4%, 1.7 times as much), dementia (6.7%, 1.5 times as much), falls (23.6%, 3.2 times as much), malnutrition (2.5%, 18.0 times as much), lack of social support (29.8%, 455.9 times as much), urinary retention (4.2%, 3.9 times as much), vision impairment (6.2%, 7.4 times as much), weight loss (19.2%, 2.9 as much), and walking difficulty (36.34%, 3.4 as much). The geriatric syndrome rates extracted from structured data were substantially lower than published epidemiological rates, although adding the NLP results considerably closed this gap. Claims and structured EHR data give an incomplete picture of burden related to geriatric syndromes. Geriatric syndromes are likely to be missed if unstructured data are not analyzed. Pragmatic NLP algorithms can assist with identifying individuals at high risk of experiencing geriatric syndromes and improving coordination of care for older adults.

  • Conference Article
  • 10.1109/icecet52533.2021.9698504
Comparison of Structured and Free-text Based Features for Rehospitalization Prediction among Patients with Severe Mental Illness
  • Dec 9, 2021
  • Yan Cheng + 3 more

Background: Clinical risk prediction is to help identify the patients with high risk of poor outcomes. The creation of predictive models heavily relies on structured data. It is not well understood the benefit of adding free-text features in improving risk prediction. Objectives: We aimed to predict the 30-day and 3-year rehospitalization risk among patients with severe mental diseases and to compare predictive performances of using structured data alone, unstructured data alone, and combined structured and unstructured data. Methods: Veterans with ≥2 diagnoses of serious mental diseases including schizophrenia (SZ), schizoaffective disorder (SAD), bipolar disorder (BP), and major depressive disorder (MDD) were sampled from the Veteran Healthcare Administration (VHA) databases. Topic modeling methods were used to process unstructured data to identify stable topics. Correlation tests and automatic stepwise models were used to select features. Results: Among 139,830 patients with severe mental disease, 76.2% diagnosed with MDD, 23.4% for BP, 8.3% for SZ, and 3.5% for SAD. Regardless of 30-day and 3-year rehospitalization outcomes, patients with following characteristics were at higher risk of hospitalization: male, homeless, having a condition of SAD, primarily diagnosed with and treated for alcohol withdrawal, and diagnosed with cocaine abuse. Besides, patients who had topics related to homeless, suicide, and substance abuse were also at higher risk. In all models, the prediction accuracy based on the individual level measured by c-statistic estimate was in the range of 0.6-0.7 regardless of features only from structured data, only from unstructured data, or from both and unstructured data. Conclusion: Topics identified from the unstructured data made contribution to predicting short-term and long-term rehospitalization. Their importance was as good as the features identified from the structured data such as demographics, ICDs, CPTs, and medications.

  • Book Chapter
  • 10.1016/b978-0-12-816916-2.00003-6
Chapter 1.3 - The “Great Divide”
  • Jan 1, 2019
  • Data Architecture
  • W.H Inmon + 2 more

Chapter 1.3 - The “Great Divide”

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.