Manual Data Annotation Research Articles

BackgroundSecondary use of routine medical data is key to large-scale clinical and health services research. In a maximum care hospital, the volume of data generated exceeds the limits of big data on a daily basis. This so-called “real world data” are essential to complement knowledge and results from clinical trials. Furthermore, big data may help in establishing precision medicine. However, manual data extraction and annotation workflows to transfer routine data into research data would be complex and inefficient. Generally, best practices for managing research data focus on data output rather than the entire data journey from primary sources to analysis. To eventually make routinely collected data usable and available for research, many hurdles have to be overcome. In this work, we present the implementation of an automated framework for timely processing of clinical care data including free texts and genetic data (non-structured data) and centralized storage as Findable, Accessible, Interoperable, Reusable (FAIR) research data in a maximum care university hospital.MethodsWe identify data processing workflows necessary to operate a medical research data service unit in a maximum care hospital. We decompose structurally equal tasks into elementary sub-processes and propose a framework for general data processing. We base our processes on open-source software-components and, where necessary, custom-built generic tools.ResultsWe demonstrate the application of our proposed framework in practice by describing its use in our Medical Data Integration Center (MeDIC). Our microservices-based and fully open-source data processing automation framework incorporates a complete recording of data management and manipulation activities. The prototype implementation also includes a metadata schema for data provenance and a process validation concept. All requirements of a MeDIC are orchestrated within the proposed framework: Data input from many heterogeneous sources, pseudonymization and harmonization, integration in a data warehouse and finally possibilities for extraction or aggregation of data for research purposes according to data protection requirements.ConclusionThough the framework is not a panacea for bringing routine-based research data into compliance with FAIR principles, it provides a much-needed possibility to process data in a fully automated, traceable, and reproducible manner.

Read full abstract

Context. Most research in grammatical and stylistic error correction focuses on error correction in English-language textual content. Thanks to the availability of large data sets, a significant increase in the accuracy of English grammar correction has been achieved. Unfortunately, there are few studies on other languages. Systems for the English language are constantly developing and currently actively use machine learning methods: classification (sequence tagging) and machine translation. A large amount of parallel or manually labelled data is required to build a high-quality machine learning model for correcting grammatical/stylistic errors in the texts of those morphologically complex languages. Manual data annotation requires a lot of effort by professional linguists, which makes the creation of text corpora, especially in morphologically rich languages, mainly Ukrainian, a time- and resource-consuming process. Objective of the study is to develop a technology for correcting errors in Ukrainian-language texts based on machine learning methods using a small set of annotated parallel data. Method. For this study, machine learning algorithms were selected when developing a system for correcting errors in Ukrainianlanguage texts using an optimal pipeline, including pre-processing and selecting text content and generating features in small annotated data corpora. The neural network’s use with a new architecture, a review of state-of-the-art methods, and a comparison of different pipeline stages will make it possible to determine such a combination of them, allowing a high-quality error correction model in Ukrainian-language texts. Results. A machine learning model for error correction in Ukrainian-language texts has been developed. A universal scheme for creating an error correction system for different languages is proposed. According to the results, the neural network can correct simple sentences written in Ukrainian. However, creating a full-fledged system will require spell-checking using dictionaries and checking rules, both simple and based on the result of parsing dependencies or other features. The pre-trained neural translation model mT5 has the best performance among the three models. To save computing resources, it is also possible to use a pre-trained BERT-type neural network as an encoder and a decoder. Such a neural network has half the number of parameters as other pretrained machine translation models and shows satisfactory results in correcting grammatical and stylistic errors. Conclusions. The created model shows excellent classification results on test data. The calculated machine translation quality metrics allow only a partial comparison of the models since most of the words and phrases in the original and corrected sentences are the same. The best value for both BLEU (0.908) and METEOR (0.956) is obtained for mT5, which is consistent with the case study in which the most accurate error corrections without changing the initial value of the sentence are obtained for such a neural network. The M2M100 has a higher BLEU score (0.847) than the “Ukrainian Roberta” Encoder-Decoder (0.697). However, subjectively evaluating the results of the correction of examples, the M2M100 does a much worse job than the other two models. For METEOR, M2M100 (0.925) also has a higher score than the “Ukrainian Roberta” Encoder-Decoder (0.876).

Read full abstract

Manual Data Annotation Research Articles

Related Topics

Articles published on Manual Data Annotation

Corn kernel classification from few training samples

AARDVARK: an automated reversion detector for variants affecting resistance kinetics.

Highly Flexible Deep-Learning-Based Automatic Analysis for Graphically Encoded Hydrogel Microparticles.

Efficient unsupervised learning of biological images with compressed deep features

Analysis of contextualized intensity in Men’s elite handball using graph-based deep learning

ASPER: Answer Set Programming Enhanced Neural Network Models for Joint Entity-Relation Extraction

SRL-ACO: A text augmentation framework based on semantic role labeling and ant colony optimization

TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining

FAIRness through automation: development of an automated medical data integration infrastructure for FAIR health data in a maximum care university hospital

Novel tools for neuronal activity, vessel diameter and red blood cell velocity imaging

Vader Lexicon and Support Vector Machine Algorithm to Detect Customer Sentiment Orientation

An efficient memory reserving-and-fading strategy for vector quantization based 3D brain segmentation and tumor extraction using an unsupervised deep learning network.

Self-supervised maize kernel classification and segmentation for embryo identification.

Accuracy of Manual Intracranial Pressure Recording Compared to a Computerized High-Resolution System: A CENTER-TBI Analysis

Research on the construction of event logic knowledge graph of supply chain management

Bayesian detection and tracking of odontocetes in 3D from their echolocation clicks

Efficient Deep Reinforcement Learning-Enabled Recommendation

Deep Learning Based Text Classification Methods

TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS

Synthetic Datasets for Rebar Instance Segmentation Using Mask R-CNN

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Manual Data Annotation Research Articles

Related Topics

Articles published on Manual Data Annotation

Corn kernel classification from few training samples

AARDVARK: an automated reversion detector for variants affecting resistance kinetics.

Highly Flexible Deep-Learning-Based Automatic Analysis for Graphically Encoded Hydrogel Microparticles.

Efficient unsupervised learning of biological images with compressed deep features

Analysis of contextualized intensity in Men’s elite handball using graph-based deep learning

ASPER: Answer Set Programming Enhanced Neural Network Models for Joint Entity-Relation Extraction

SRL-ACO: A text augmentation framework based on semantic role labeling and ant colony optimization

TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining

FAIRness through automation: development of an automated medical data integration infrastructure for FAIR health data in a maximum care university hospital

Novel tools for neuronal activity, vessel diameter and red blood cell velocity imaging

Vader Lexicon and Support Vector Machine Algorithm to Detect Customer Sentiment Orientation

An efficient memory reserving-and-fading strategy for vector quantization based 3D brain segmentation and tumor extraction using an unsupervised deep learning network.

Self-supervised maize kernel classification and segmentation for embryo identification.

Accuracy of Manual Intracranial Pressure Recording Compared to a Computerized High-Resolution System: A CENTER-TBI Analysis

Research on the construction of event logic knowledge graph of supply chain management

Bayesian detection and tracking of odontocetes in 3D from their echolocation clicks

Efficient Deep Reinforcement Learning-Enabled Recommendation

Deep Learning Based Text Classification Methods

TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS

Synthetic Datasets for Rebar Instance Segmentation Using Mask R-CNN