Unstructured Text Data Research Articles

We would like to introduce our recently developed systems for taking images of herbarium specimens and for the automatic extraction of data from specimen labels at the Herbarium of the Museum of Nature and Human Activities, Hyogo, Japan (HYO). Firstly, we designed a low-cost, but high-quality specimen imaging system for non-professional photographers to obtain images rapidly (Takano et al. 2019). Our system uses a mass-produced, mirrorless single-lens reflex (SLR) camera (SONY ILCE6300) with a zoom lens (Samyang Optics SYIO35AF-E35 mm F/2.8). We made a photo stand by ourselves to reduce costs. In addition, we have adopted an LED (light-emitting diode) lighting system with high color rendering. This imaging system has been introduced, with some improvements or adjustments for available space, to various herbaria in Japan (e.g., University of Tokyo (TI), Kyoto University (KYO)), contributing to the digitization of herbarium specimens across Japan. Next, we developed a system to extract label information from specimen images. The specimen image was uploaded to Google OCR and data were extracted in the form of text. Uploading the whole specimen image decreased the reading accuracy of the software because the plant images behaved as OCR (Optical Character Reader) noise. Therefore, the label part was cut out from the whole specimen image by using D-Lib*1 and uploaded to tesseract OCR*2 for OCR extraction of the label information (Aoki 2019, Takano et al. 2020). When installing this system for HYO, we designed it as an application accessible externally via the internet, which proved very useful during the coronavirus pandemic: part-time workers checked and conducted label data input from home. Finally, we decided to develop a system that would automatically label the text data extracted by OCR and input them into the appropriate cells of the database. Even though the text data could be extracted from specimen images, it needed a human to input them into the database. Therefore, we adopted Named Entity Recognition (NER), a system that extracts named entities such as place names, identifying proper nouns from unstructured text data. It enables information recorded in herbarium specimens to be tagged as named entities. We tried text matching at first, but the result was not satisfactory, so we started to use machine learning instead. We compared three natural language libraries for Japanese: BERT (Bidirectional Encoder Representations from Transformers), Albert (A Lite version of BERT), and SpaCy. Despite BERT and SpaCy returning similarly high f-scores (indicating good performance), we decided to use SpaCy because it runs better on ordinary PCs or servers. With sufficient machine learning after the creation of a text corpus (a specialised dataset) specific to labels on herbarium specimens, we successfully developed the application. The project files are available on GitHub*3 (Takano et al. 2024). We then examined whether this system could be applied to non-plant specimen images, i.e., fishes or birds, and found that it could efficiently extract data. Therefore, we decided to publicize this system on the cloud server and share it with other natural history museums in Japan*4. Curators can obtain a unique ID and password and upload specimen images from their collection to extract label data. The digitization of natural history collections in Japan has been long behind other countries, and this system will help to accelerate it. The system mentioned above is specialized for the natural history collections of Japan, but we believe it is possible to build similar programs in other countries, and we hope our experience will contribute to the mobilization of the world’s natural history collections.

Read full abstract

Background: The increasing rate of intensive care unit (ICU) readmissions poses significant challenges in healthcare, impacting both costs and patient outcomes. Predicting patient readmission after discharge is crucial for improving medical quality and reducing expenses. Traditional analyses of electronic health record (EHR) data have primarily focused on numerical data, often neglecting valuable text data. Methods: This study employs a hybrid model combining BERTopic and Long Short-Term Memory (LSTM) networks to predict ICU readmissions. Leveraging the MIMIC-III database, we utilize both quantitative and text data to enhance predictive capabilities. Our approach integrates the strengths of unsupervised topic modeling with supervised deep learning, extracting potential topics from patient records and transforming discharge summaries into topic vectors for more interpretable and personalized predictions. Results: Utilizing a comprehensive dataset of 36,232 ICU patient records, our model achieved an AUROC score of 0.80, thereby surpassing the performance of traditional machine learning models. The implementation of BERTopic facilitated effective utilization of unstructured data, generating themes that effectively guide the selection of relevant predictive factors for patient readmission prognosis. This significantly enhanced the model's interpretative accuracy and predictive capability. Additionally, the integration of importance ranking methods into our machine learning framework allowed for an in-depth analysis of the significance of various variables. This approach provided crucial insights into how different input variables interact and impact predictions of patient readmission across various clinical contexts. Conclusions: The practical application of BERTopic technology in our hybrid model contributes to more efficient patient management and serves as a valuable tool for developing tailored treatment strategies and resource optimization. This study highlights the significance of integrating unstructured text data with traditional quantitative data to develop more accurate and interpretable predictive models in healthcare, emphasizing the importance of individualized care and cost-effective healthcare paradigms.

Read full abstract

Unstructured Text Data Research Articles

Related Topics

Articles published on Unstructured Text Data

Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline

Leveraging LLMs for Unstructured Direct Elicitation of Decision Rules

An Extended Pattern Based Comprehensive Stemmer for the Urdu Language

Exploring Natural Language Processing through an Exemplar Using YouTube.

Understanding critical service factors in neobanks: crafting strategies through text mining

Comparing public health-related material in print and web page versions of legacy media.

Development of an Automated Label Data Entry System from Herbarium Specimen Images at Hyogo Herbarium (HYO)

Predicting ICU Readmission from Electronic Health Records via BERTopic with Long Short Term Memory Network Approach.

Real-Time Extraction of News Events Based on BERT Model

A Review of Sentiment Analysis in Social Media Perspectives

Visualizing Nursing Narratives: An Evaluation of Latent Dirichlet Allocation Topic Modeling for Care Reports.

Manufacturing time estimation for offer pricing: A machine learning application in a French metallurgy industry

Multiclass Text Classification of Climate Change Reports: Insights into Natural Disasters, Impacts, Locations, and Time Periods

MedNER: A Service-Oriented Framework for Chinese Medical Named-Entity Recognition with Real-World Application

Large language models overcome the challenges of unstructured text data in ecology

ADVANCED TECHNIQUES IN MULTI-LABEL TEXT CLASSIFICATION: INTEGRATION OF BETA ANT COLONY AND DEEP LEARNING APPROACHES

인천 관광지의 속성별 평가를 위한 딥러닝 기반 감성분석

Multi-Source Information Graph Embedding with Ensemble Learning for Link Prediction

Crafting clarity: Leveraging large language models to decode consumer reviews

Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Unstructured Text Data Research Articles

Related Topics

Articles published on Unstructured Text Data

Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline

Leveraging LLMs for Unstructured Direct Elicitation of Decision Rules

An Extended Pattern Based Comprehensive Stemmer for the Urdu Language

Exploring Natural Language Processing through an Exemplar Using YouTube.

Understanding critical service factors in neobanks: crafting strategies through text mining

Comparing public health-related material in print and web page versions of legacy media.

Development of an Automated Label Data Entry System from Herbarium Specimen Images at Hyogo Herbarium (HYO)

Predicting ICU Readmission from Electronic Health Records via BERTopic with Long Short Term Memory Network Approach.

Real-Time Extraction of News Events Based on BERT Model

A Review of Sentiment Analysis in Social Media Perspectives

Visualizing Nursing Narratives: An Evaluation of Latent Dirichlet Allocation Topic Modeling for Care Reports.

Manufacturing time estimation for offer pricing: A machine learning application in a French metallurgy industry

Multiclass Text Classification of Climate Change Reports: Insights into Natural Disasters, Impacts, Locations, and Time Periods

MedNER: A Service-Oriented Framework for Chinese Medical Named-Entity Recognition with Real-World Application

Large language models overcome the challenges of unstructured text data in ecology

ADVANCED TECHNIQUES IN MULTI-LABEL TEXT CLASSIFICATION: INTEGRATION OF BETA ANT COLONY AND DEEP LEARNING APPROACHES

인천 관광지의 속성별 평가를 위한 딥러닝 기반 감성분석

Multi-Source Information Graph Embedding with Ensemble Learning for Link Prediction

Crafting clarity: Leveraging large language models to decode consumer reviews

Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models