High-quality Datasets Research Articles

This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.Scientific contributionThe proposed automated preprocessing tool for chemical reaction data aims to identify errors within chemical databases. Specifically, if the errors involve atom mapping or the absence of reactant types, corrections can be systematically applied using reaction templates, ultimately elevating the overall quality of the database.Graphical

e13627 Background: The analysis of genomic variants is crucial in precision oncology research, offering insights into cancer risks and progression, especially in diverse types such as lung adenocarcinoma (LUAD). However, such research often grapples with balancing patient privacy with the need for comprehensive, high-quality genomic datasets. Our project addresses this by creating synthetic clinical-genomic data, which maintains patient confidentiality and provides a rich resource for genomic cancer research. Methods: Leveraging the GuardantINFORM database, which includes anonymized genomic data and structured payer claims, we focused on generating synthetic data for LUAD patient cohorts. This approach involves processing real patient data into a format compatible with Medisyn’s generative AI models, ensuring the synthetic data retains the original's statistical properties, and processing the output back into the original database structure and format. This method plays a crucial role in maintaining patient privacy and serves as a valuable tool for research by enabling the generation of realistic patients with desired properties on demand. Results: Our synthetic data closely mirrors real-world genomic and claims variable distributions, evidenced by a 0.994 R2 correlation between real and synthetic data along with comparable Oncoprints. Importantly, privacy tests show that patient confidentiality is effectively maintained despite this effective performance. The synthetic data's utility was then demonstrated in a study replicating real-world findings: LUAD patients with KRAS G12C in combination with STK11 mutations showed a significantly higher risk of early mortality. This underscores the potential of synthetic data in advancing cancer research. Conclusions: This research offers a promising avenue for the cancer research community. By providing a method to share privatized, synthetic genomic data, which can be combined and generated on demand, we enable broader, more responsible data sharing. This approach protects patient privacy and offers a rich dataset for groundbreaking research, potentially accelerating advances in cancer diagnosis and treatment. [Table: see text]

High-quality Datasets Research Articles

Articles published on High-quality Datasets

MaDroid: A maliciousness-aware multifeatured dataset for detecting android malware

AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry

Accurate prediction of CDR-H3 loop structures of antibodies with deep learning

Accurate prediction of CDR-H3 loop structures of antibodies with deep learning.

Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models

Pulse wave signal-driven machine learning for identifying left ventricular enlargement in heart failure patients

Broiler health monitoring technology based on sound features and random forest

Effect of Using Numerical Data Scaling on Supervised Machine Learning Performance

BCN20000: Dermoscopic Lesions in the Wild

Digital-twin-driven intelligent tracking error compensation of ultra-precision machining

A high-quality dataset featuring classified and annotated cervical spine X-ray atlas

De novo transcriptomes of cave and surface isopod crustaceans: insights from 11 species across three suborders

Revisiting Bundle Recommendation for Intent-aware Product Bundling

Meta-Fed IDS: Meta-learning and Federated learning based fog-cloud approach to detect known and zero-day cyber attacks in IoMT networks

Benchmarking compound activity prediction for real-world drug discovery applications

Prescriptive procedure for manual code smell annotation

Feature Selection Techniques in Intrusion Detection: A Comprehensive Review

Harnessing AI for solar energy: Emergence of transformer models

ParisLuco3D: A High-Quality Target Dataset for Domain Generalization of LiDAR Perception

AI-generated synthetic clinical-genomic data for precision oncology research: Validation using a case study on lung adenocarcinoma.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-quality Datasets Research Articles

Articles published on High-quality Datasets

MaDroid: A maliciousness-aware multifeatured dataset for detecting android malware

AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry

Accurate prediction of CDR-H3 loop structures of antibodies with deep learning

Accurate prediction of CDR-H3 loop structures of antibodies with deep learning.

Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models

Pulse wave signal-driven machine learning for identifying left ventricular enlargement in heart failure patients

Broiler health monitoring technology based on sound features and random forest

Effect of Using Numerical Data Scaling on Supervised Machine Learning Performance

BCN20000: Dermoscopic Lesions in the Wild

Digital-twin-driven intelligent tracking error compensation of ultra-precision machining

A high-quality dataset featuring classified and annotated cervical spine X-ray atlas

De novo transcriptomes of cave and surface isopod crustaceans: insights from 11 species across three suborders

Revisiting Bundle Recommendation for Intent-aware Product Bundling

Meta-Fed IDS: Meta-learning and Federated learning based fog-cloud approach to detect known and zero-day cyber attacks in IoMT networks

Benchmarking compound activity prediction for real-world drug discovery applications

Prescriptive procedure for manual code smell annotation

Feature Selection Techniques in Intrusion Detection: A Comprehensive Review

Harnessing AI for solar energy: Emergence of transformer models

ParisLuco3D: A High-Quality Target Dataset for Domain Generalization of LiDAR Perception

AI-generated synthetic clinical-genomic data for precision oncology research: Validation using a case study on lung adenocarcinoma.