Synthetic Datasets Research Articles

Uncertainty of data, the degree to which data are inaccurate, imprecise, untrusted, and undetermined, is inherent in many contemporary database applications, and numerous research endeavours have been devoted to efficiently answer skyline queries over uncertain data. The literature discussed two different methods that could be used to handle the data uncertainty in which objects having continuous range values. The first method employs a probability-based approach, while the second assumes that the uncertain values are represented by their median values. Nevertheless, neither of these methods seem to be suitable for the modern high-dimensional uncertain databases due to the following reasons. The first method requires an intensive probability calculations while the second is impractical. Therefore, this work introduces an index, non-probability framework named Constrained Skyline Query processing on Uncertain Data (CSQUiD) aiming at reducing the computational time in processing constrained skyline queries over uncertain high-dimensional data. Given a collection of objects with uncertain data, the CSQUiD framework constructs the minimum bounding rectangles (MBRs) by employing the X-tree indexing structure. Instead of scanning the whole collection of objects, only objects within the dominant MBRs are analyzed in determining the final skylines. In addition, CSQUiD makes use of the Fuzzification approach where the exact value of each continuous range value of those dominant MBRs’ objects is identified. The proposed CSQUiD framework is validated using real and synthetic data sets through extensive experimentations. Based on the performance analysis conducted, by varying the sizes of the constrained query, the CSQUiD framework outperformed the most recent methods (CIS algorithm and SkyQUD-T framework) with an average improvement of 44.07% and 57.15% with regards to the number of pairwise comparisons, while the average improvement of CPU processing time over CIS and SkyQUD-T stood at 27.17% and 18.62%, respectively.

We aimed to develop a framework for generating synthetic identifier datasets to support development and evaluation of data linkage methods. We evaluated whether replicating associations between attributes and identifiers improved the utility of the synthetic data for assessing linkage error. We determined the steps required to generate synthetic identifiers that replicate the properties of real-world data collection. We generated synthetic versions of a large United Kingdom cohort study (the Avon Longitudinal Study of Parents and Children), according to the quality and completeness of identifiers recorded over several waves of the cohort. We evaluated the utility of the synthetic identifier data in terms of assessing linkage quality (false matches and missed matches). Comparing data from two collection points in ALSPAC, we found within-person disagreement in identifiers (differences in recording due to both natural change and non-valid entries) in 18% of surnames and 12% of forenames. Rates of disagreement varied by maternal age and ethnic group. Synthetic data provided accurate estimates of linkage quality metrics compared with the original data (within 0.13-0.55% for missed matches and 0.00-0.04% for false matches). Incorporating associations between identifier errors and maternal age/ethnicity improved synthetic data utility. Replicating dependencies between attribute values (e.g. ethnicity), values of identifiers (e.g. name), identifier disagreements (e.g. missing values, errors or changes over time), and their patterns and distribution structure enables generation of realistic synthetic data that can be used for robust evaluation of linkage methods. Our framework provides a novel and generalisable mechanism for developing and benchmarking record linkage algorithms.

Synthetic Datasets Research Articles

Related Topics

Articles published on Synthetic Datasets

When guided diffusion model meets zero-shot image super-resolution

EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery

CSQUiD: an index and non-probability framework for constrained skyline query processing over uncertain data

Embedded U-shaped network with cross-hierarchical feature adaptation fusion for remote sensing image haze removal

An Expectation-Maximization framework for Personalized Itinerary Recommendation with POI Categories and Must-see POIs

On the development and validation of large language model-based classifiers for identifying social determinants of health

RADIAN – A tool for generating synthetic spatial data for use in teaching and learning

Joint consensus kernel learning and adaptive hypergraph regularization for graph-based clustering

ES-GNN: Generalizing Graph Neural Networks Beyond Homophily With Edge Splitting.

Soil water content estimation by using ground penetrating radar data full waveform inversion with grey wolf optimizer algorithm

Target-Specific De Novo Peptide Binder Design with DiffPepBuilder.

Efficient Large-scale Nonstationary Spatial Covariance Function Estimation Using Convolutional Neural Networks

Bayesian seismic 4D inversion for lithology and fluid prediction

Multidimensional time series motif group discovery based on matrix profile

Co-Clustering by Directly Solving Bipartite Spectral Graph Partitioning.

BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks

Generating synthetic identifiers to support development and evaluation of data linkage methods

Reservoir oriented migrated seismic image improvement and poststack seismic inversion using well-seismic mistie

Data‐driven simulation of functional fatigue in shape memory alloy wires

Exploring methods for generating synthetic data in Scotland to improve access to public sector data for research

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Synthetic Datasets Research Articles

Related Topics

Articles published on Synthetic Datasets

When guided diffusion model meets zero-shot image super-resolution

EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery

CSQUiD: an index and non-probability framework for constrained skyline query processing over uncertain data

Embedded U-shaped network with cross-hierarchical feature adaptation fusion for remote sensing image haze removal

An Expectation-Maximization framework for Personalized Itinerary Recommendation with POI Categories and Must-see POIs

On the development and validation of large language model-based classifiers for identifying social determinants of health

RADIAN – A tool for generating synthetic spatial data for use in teaching and learning

Joint consensus kernel learning and adaptive hypergraph regularization for graph-based clustering

ES-GNN: Generalizing Graph Neural Networks Beyond Homophily With Edge Splitting.

Soil water content estimation by using ground penetrating radar data full waveform inversion with grey wolf optimizer algorithm

Target-Specific De Novo Peptide Binder Design with DiffPepBuilder.

Efficient Large-scale Nonstationary Spatial Covariance Function Estimation Using Convolutional Neural Networks

Bayesian seismic 4D inversion for lithology and fluid prediction

Multidimensional time series motif group discovery based on matrix profile

Co-Clustering by Directly Solving Bipartite Spectral Graph Partitioning.

BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks

Generating synthetic identifiers to support development and evaluation of data linkage methods

Reservoir oriented migrated seismic image improvement and poststack seismic inversion using well-seismic mistie

Data‐driven simulation of functional fatigue in shape memory alloy wires

Exploring methods for generating synthetic data in Scotland to improve access to public sector data for research