Fully automated synthetic BIM dataset generation using a deep learning-based framework
Fully automated synthetic BIM dataset generation using a deep learning-based framework
31
- 10.1016/j.autcon.2023.104937
- May 23, 2023
- Automation in Construction
96
- 10.1016/j.jclepro.2020.125622
- Dec 31, 2020
- Journal of Cleaner Production
93
- 10.1007/978-3-030-27477-1_30
- Aug 3, 2019
43
- 10.3390/buildings12060830
- Jun 14, 2022
- Buildings
20
- 10.3390/su12176713
- Aug 19, 2020
- Sustainability
8
- 10.1016/j.autcon.2023.105132
- Oct 20, 2023
- Automation in Construction
3
- 10.1109/access.2024.3451406
- Jan 1, 2024
- IEEE Access
181
- 10.1016/j.autcon.2011.05.023
- Jun 21, 2011
- Automation in Construction
22
- 10.1016/j.autcon.2023.105156
- Nov 1, 2023
- Automation in Construction
- Retracted
3
- 10.1371/journal.pone.0187513
- Nov 17, 2017
- PLoS ONE
- Research Article
23
- 10.3390/cancers14184457
- Sep 14, 2022
- Cancers
Simple SummaryGliomas comprise 80% of all malignant brain tumors. We aimed to develop a deep learning-based framework for the automatic segmentation and characterization of gliomas. In this retrospective study, patients were included if they: (1) had a diagnosis of glioma confirmed by histopathology and (2) had preoperative MRI with the inclusion of FLAIR imaging. The deep learning-based U-Net framework was developed based on manual segmentation on FLAIR as the ground truth mask for automatic segmentation and feature extraction, which were used for the prediction of biomarker status and prognosis. A total of 208 patients were included from our internal dataset with stratified sampling to split the database into training and validation. An external dataset (n = 31) from an outside institution was used for testing. The dice similarity coefficient of the generated mask was 0.93 on the testing dataset. The prediction of the radiomic model achieved an AUC of 0.88 for IDH-1 and 0.62 for MGMT on the testing dataset. Our deep learning-based framework can detect and segment gliomas with excellent performance for the prediction of IDH-1 biomarker status and survival.(1) Background: Gliomas are the most common primary brain neoplasms accounting for roughly 40–50% of all malignant primary central nervous system tumors. We aim to develop a deep learning-based framework for automated segmentation and prediction of biomarkers and prognosis in patients with gliomas. (2) Methods: In this retrospective two center study, patients were included if they (1) had a diagnosis of glioma with known surgical histopathology and (2) had preoperative MRI with FLAIR sequence. The entire tumor volume including FLAIR hyperintense infiltrative component and necrotic and cystic components was segmented. Deep learning-based U-Net framework was developed based on symmetric architecture from the 512 × 512 segmented maps from FLAIR as the ground truth mask. (3) Results: The final cohort consisted of 208 patients with mean ± standard deviation of age (years) of 56 ± 15 with M/F of 130/78. DSC of the generated mask was 0.93. Prediction for IDH-1 and MGMT status had a performance of AUC 0.88 and 0.62, respectively. Survival prediction of <18 months demonstrated AUC of 0.75. (4) Conclusions: Our deep learning-based framework can detect and segment gliomas with excellent performance for the prediction of IDH-1 biomarker status and survival.
- Research Article
1
- 10.5731/pdajpst.2021.012659
- Jan 1, 2022
- PDA journal of pharmaceutical science and technology
Application of synthetic datasets in training and validation of analysis tools has led to improvements in many decision-making tasks in a range of domains from computer vision to digital pathology. Synthetic datasets overcome the constraints of real-world datasets, namely difficulties in collection and labeling, expense, time, and privacy concerns. In flow cytometry, real cell-based datasets are limited by properties such as size, number of parameters, distance between cell populations, and distributions and are often focused on a narrow range of disease or cell types. Researchers in some cases have designed these desired properties into synthetic datasets; however, operators have implemented them in inconsistent approaches, and there is a scarcity of publicly available, high-quality synthetic datasets. In this research, we propose a method to systematically design and generate flow cytometry synthetic datasets with highly controlled characteristics. We demonstrate the generation of two-cluster synthetic datasets with specific degrees of separation between cell populations, and of non-normal distributions with increasing levels of skewness and orientations of skew pairs. We apply our synthetic datasets to test the performance of a popular automated cell populations identification software, SPADE3, and define the region where the software performance decreases as the clusters get closer together. Application of the synthetic skewed dataset suggests the software is capable of processing non-normal data. We calculate the classification accuracy of SPADE3 with robustness not achievable with real-world datasets. Our approach aims to advance research toward generation of high-quality synthetic flow cytometry datasets and to increase their awareness among the community. The synthetic datasets can be used in benchmarking studies that critically evaluate cell population identification tools and help illustrate potential digital platform inconsistencies. These datasets have the potential to improve cell characterization workflows that integrate automated analysis in clinical diagnostics and cell therapy manufacturing.
- Conference Article
127
- 10.1109/smartgridcomm.2018.8587464
- Oct 1, 2018
The availability of fine grained time series data is a pre-requisite for research in smart-grids. While data for transmission systems is relatively easily obtainable, issues related to data collection, security and privacy hinder the widespread public availability/accessibility of such datasets at the distribution system level. This has prevented the larger research community from effectively applying sophisticated machine learning algorithms to significantly improve the distribution-level accuracy of predictions and increase the efficiency of grid operations. Synthetic dataset generation has proven to be a promising solution for addressing data availability issues in various domains such as computer vision, natural language processing and medicine. However, its exploration in the smart grid context remains unsatisfactory. Previous works have tried to generate synthetic datasets by modeling the underlying system dynamics: an approach which is difficult, time consuming, error prone and often times infeasible in many problems. In this work, we propose a novel data-driven approach to synthetic dataset generation by utilizing deep generative adversarial networks (GAN) to learn the conditional probability distribution of essential features in the real dataset and generate samples based on the learned distribution. To evaluate our synthetically generated dataset, we measure the maximum mean discrepancy (MMD) between real and synthetic datasets as probability distributions, and show that their sampling distance converges. To further validate our synthetic dataset, we perform common smart grid tasks such as k-means clustering and short-term prediction on both datasets. Experimental results show the efficacy of our synthetic dataset approach: the real and synthetic datasets are indistinguishable by solely examining the output of these tasks.
- Research Article
9
- 10.1016/j.isprsjprs.2022.04.004
- Apr 26, 2022
- ISPRS Journal of Photogrammetry and Remote Sensing
Deep-learning generation of POI data with scene images
- Research Article
- 10.3390/app15073490
- Mar 22, 2025
- Applied Sciences
This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications.
- Conference Article
19
- 10.1109/icpr.2008.4761770
- Dec 1, 2008
Usually, performance of classifiers is evaluated on real-world problems that mainly belong to public repositories. However, we ignore the inherent properties of these data and how they affect classifier behavior. Also, the high cost or the difficulty of experiments hinder the data collection, leading to complex data sets characterized by few instances, missing values, and imprecise data. The generation of synthetic data sets solves both issues and allows us to build problems with a minor cost and whose characteristics are predefined. This is useful to test system limitations in a controlled framework. This paper proposes to generate synthetic data sets based on data complexity. We rely on the length of the class boundary to build the data sets, obtaining a preliminary set of benchmarks to assess classifier accuracy. The study can be further matured to identify regions of competence for classifiers.
- Research Article
- 10.1049/cim2.70039
- Jan 1, 2025
- IET Collaborative Intelligent Manufacturing
ABSTRACTIndustrial manufacturing faces many challenges and opportunities as novel technologies change how products are designed and produced. The design step of a product requires skills and time, starting from conceptualising the object's 3D shape. However, AI models have been proven capable of reconstructing 3D models from images. Thus, a designer may approach the modelling phase of a product with traditional CAD software, relying not only on existing 3D models but also on the digitalisation of everyday real objects, prototypes, or photographs. However, AI models need to be trained on extensive datasets to obtain reliable behaviours, and the manual creation of such datasets is usually time‐consuming. Synthetic datasets could speed up the model's training process providing automatically labelled data for the objects of interest for the designer. This research explores a novel approach to foster synthetic dataset generation for 3D object reconstruction. The proposed pipeline involves setting up 3D models and customising the rendering pipeline to create datasets with different rendering properties automatically. These datasets are then used to train and test a 3D object reconstruction model to investigate how to improve synthetic dataset generation to optimise performance.
- Research Article
- 10.3390/s25092825
- Apr 30, 2025
- Sensors (Basel, Switzerland)
In the field of Cyber Threat Intelligence (CTI), the scarcity of high-quality and labelled datasets that include Indicators of Compromise (IoCs) impact the design and implementation of robust predictive models that are capable of classifying IoCs in online communication, specifically in social media contexts where users are potentially highly exposed to cyber threats. Thus, the generation of high-quality synthetic datasets can be utilized to fill this gap and develop effective CTI systems. Therefore, this study aims to fine-tune OpenAI's Large Language Model (LLM), Gpt-3.5, to generate a synthetic dataset that replicates the style of a real social media curated dataset, as well as incorporates select IoCs as domain knowledge. Four machine-learning (ML) and deep-learning (DL) models were evaluated on two generated datasets (one with 4000 instances and the other with 12,000). The results indicated that, on the 4000-instance dataset, the Dense Neural Network (DenseNN) outputs the highest accuracy (77%), while on the 12,000-instance dataset, Logistic Regression (LR) achieved the highest accuracy of 82%. This study highlights the potential of integrating fine-tuned LLMs with domain-specific knowledge to create high-quality synthetic data. The main contribution of this research is in the adoption of fine-tuning of an LLM, Gpt-3.5, using real social media datasets and curated IoC domain knowledge, which is expected to improve the process of synthetic dataset generation and later IoC extraction and classification, offering a realistic and novel resource for cybersecurity applications.
- Conference Article
1
- 10.2118/207266-ms
- Dec 9, 2021
In the O&G (Oil & Gas) industry, unstructured data sources such as technical reports on hydrocarbon production, daily drilling, well construction, etc. contain valuable information. This information however is conveyed through various formats such as tables, forms, text, figures, etc. Detecting these different entities in documents is essential for building a structured representation of the information within and for automated processing of documents at scale. Our work presents a document layout analysis workflow to detect/localize different entities based on a deep learning-based framework. The workflow comprises of a deep learning-based object-detection framework based on transformers to identify the spatial location of entities in a document page. The key elements of the object-detection pipeline include a residual network backbone for feature extraction and an encoder-decoder transformer based on the latest detection transformers (DETR) to predict object-bounding boxes and category labels. The object detection is formulated as a direct set prediction task using bipartite matching while also eliminating conventional operations like anchor box generation and non-maximal suppression. The availability of sufficient publicly available document layout data sets that incorporate the artifacts observed in historical O&G technical reports is often a major challenge. We attempt to address this challenge by using a novel training data augmentation methodology. The dense occurrence of elements in a page can often introduce uncertainties resulting in bounding boxes cutting through text content. We adopt a bounding box post-processing methodology to refine the bounding box coordinates to minimize undercuts. The proposed document layout analysis pipeline was trained to detect entity types such as headings, text blocks, tables, forms, and images/charts in a document page. A wide range of pages from lithology, stratigraphy, drilling, and field development reports were used for model training. The reports also included a considerable number of historical scanned reports. The trained object-detection model was evaluated on a test data set prepared from the O&G reports. DETR demonstrated superior performance when compared with the Mask R-CNN on our dataset.
- Research Article
7
- 10.1007/s13755-023-00241-y
- Aug 30, 2023
- Health Information Science and Systems
PurposeThe purpose of this study is to construct a synthetic dataset of ECG signal that overcomes the sensitivity of personal information and the complexity of disclosure policies.MethodsThe public dataset was constructed by generating synthetic data based on the deep learning model using a convolution neural network (CNN) and bi-directional long short-term memory (Bi-LSTM), and the effectiveness of the dataset was verified by developing classification models for ECG diagnoses.ResultsThe synthetic 12-lead ECG dataset generated consists of a total of 6000 ECGs, with normal and 5 abnormal groups. The synthetic ECG signal has a waveform pattern similar to the original ECG signal, the average RMSE between the two signals is 0.042 µV, and the average cosine similarity is 0.993. In addition, five classification models were developed to verify the effect of the synthetic dataset and showed performance similar to that of the model made with the actual dataset. In particular, even when the real dataset was applied as a test set to the classification model trained with the synthetic dataset, the classification performance of all models showed high accuracy (average accuracy 93.41%).ConclusionThe synthetic 12-lead ECG dataset was confirmed to perform similarly to the real-world 12-lead ECG in the classification model. This implies that a synthetic dataset can perform similarly to a real dataset in clinical research using AI. The synthetic dataset generation process in this study provides a way to overcome the medical data disclosure challenges constrained by privacy rights, a way to encourage open data policies, and contribute significantly to promoting cardiovascular disease research.
- Research Article
4
- 10.1016/j.dib.2024.110445
- Apr 20, 2024
- Data in Brief
The residential sectorʼs substantial electricity consumption, driven by heating demands during winter, necessitates optimal energy consumption strategies in the era of decarbonization. To address this challenge, this paper introduces a synthetic dataset specifically tailored to simulate energy consumption in residential apartment buildings. Focusing on the interplay of cold weather conditions and the effects of aging factors, the dataset comprehensively encompasses key variables, including indoor temperature, energy consumption, outdoor temperature, outdoor humidity and solar radiation. It underscores the considerable impact of building aging on energy consumption patterns. The datasetʼs significance extends across various domains, particularly in the realms of energy forecasting and thermal modelling. It serves as a robust foundation for predicting future consumption patterns, optimizing resource allocation, and refining energy efficiency strategies. The inclusion of indoor temperature data facilitates an in-depth thermal modelling approach, shedding light on intricate relationships that influence building performance in cold climates. Beyond traditional, the dataset proves invaluable in nonlinear modelling and machine learning. It emerges as a key tool for algorithm training, enhancing forecast precision, and supporting well-informed decision-making. The introduction of a temporal dimension by accounting for aging factors allows for the exploration of evolving building components over time, a critical consideration for sustainable energy management and building maintenance strategies. The dataset was meticulously generated by creating geometry using SketchUp and conducting energy modelling and simulations via the OpenStudio platform, which integrates the Energy Plus modelling engine to enhance accuracy. In summary, this synthetic dataset generation provides valuable insights into energy consumption in residential buildings exposed to cold weather conditions and the influences of aging. Its multifaceted applications across forecasting, modelling, management, and planning underscore its potential to advance sustainable and efficient energy practices.
- Research Article
- 10.1609/aaai.v39i9.33027
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
High-quality, pixel-level annotated datasets are crucial for training deep learning models, while their creation is often labor-intensive, time-consuming, and costly. Generative diffusion models have then gained prominence for producing synthetic datasets, yet existing text-to-data methods struggle with generating complex scenes involving multiple objects and intricate spatial arrangements. To address these limitations, we introduce FlexDataset, a framework that pioneers the composition-to-data (C2D) paradigm. FlexDataset generates high-fidelity synthetic datasets with versatile annotations, tailored for tasks like salient object detection, depth estimation, and segmentation. Leveraging a meticulously designed composition-to-image (C2I) framework, it offers precise positional and categorical control. Our Versatile Annotation Generation (VAG) Plan A further enhances efficiency by exploiting rich latent representations through tuned perception decoders, reducing annotation time by nearly fivefold. FlexDataset allows unlimited generation of customized, multi-instance and multi-category (MIMC) annotated data. Extensive experiments show that FlexDataset sets a new standard in synthetic dataset generation across multiple datasets and tasks, including zero-shot and long-tail scenarios.
- Research Article
46
- 10.1109/tvcg.2011.237
- Dec 1, 2011
- IEEE Transactions on Visualization and Computer Graphics
Generation of synthetic datasets is a common practice in many research areas. Such data is often generated to meet specific needs or certain conditions that may not be easily found in the original, real data. The nature of the data varies according to the application area and includes text, graphs, social or weather data, among many others. The common process to create such synthetic datasets is to implement small scripts or programs, restricted to small problems or to a specific application. In this paper we propose a framework designed to generate high dimensional datasets. Users can interactively create and navigate through multi dimensional datasets using a suitable graphical user-interface. The data creation is driven by statistical distributions based on a few user-defined parameters. First, a grounding dataset is created according to given inputs, and then structures and trends are included in selected dimensions and orthogonal projection planes. Furthermore, our framework supports the creation of complex non-orthogonal trends and classified datasets. It can successfully be used to create synthetic datasets simulating important trends as multidimensional clusters, correlations and outliers.
- Research Article
21
- 10.1177/14604582221077000
- Apr 1, 2022
- Health Informatics Journal
Digital health applications can improve quality and effectiveness of healthcare, by offering a number of new tools to users, which are often considered a medical device. Assuring their safe operation requires, amongst others, clinical validation, needing large datasets to test them in realistic clinical scenarios. Access to datasets is challenging, due to patient privacy concerns. Development of synthetic datasets is seen as a potential alternative. The objective of the paper is the development of a method for the generation of realistic synthetic datasets, statistically equivalent to real clinical datasets, and demonstrate that the Generative Adversarial Network (GAN) based approach is fit for purpose. A generative adversarial network was implemented and trained, in a series of six experiments, using numerical and categorical variables, including ICD-9 and laboratory codes, from three clinically relevant datasets. A number of contextual steps provided the success criteria for the synthetic dataset. A synthetic dataset that exhibits very similar statistical characteristics with the real dataset was generated. Pairwise association of variables is very similar. A high degree of Jaccard similarity and a successful K-S test further support this. The proof of concept of generating realistic synthetic datasets was successful, with the approach showing promise for further work.
- Research Article
19
- 10.1021/acs.analchem.6b02139
- Nov 4, 2016
- Analytical Chemistry
Spatial clustering is a powerful tool in mass spectrometry imaging (MSI) and has been demonstrated to be capable of differentiating tumor types, visualizing intratumor heterogeneity, and segmenting anatomical structures. Several clustering methods have been applied to mass spectrometry imaging data, but a principled comparison and evaluation of different clustering techniques presents a significant challenge. We propose that testing whether the data has a multivariate normal distribution within clusters can be used to evaluate the performance when using algorithms that assume normality in the data, such as k-means clustering. In cases where clustering has been performed using the cosine distance, conversion of the data to polar coordinates prior to normality testing should be performed to ensure normality is tested in the correct coordinate system. In addition to these evaluations of internal consistency, we demonstrate that the multivariate normal distribution can then be used as a basis for statistical modeling of MSI data. This allows the generation of synthetic MSI data sets with known ground truth, providing a means of external clustering evaluation. To demonstrate this, reference data from seven anatomical regions of an MSI image of a coronal section of mouse brain were modeled. From this, a set of synthetic data based on this model was generated. Results of r2 fitting of the chi-squared quantile-quantile plots on the seven anatomical regions confirmed that the data acquired from each spatial region was found to be closer to normally distributed in polar space than in Euclidean. Finally, principal component analysis was applied to a single data set that included synthetic and real data. No significant differences were found between the two data types, indicating the suitability of these methods for generating realistic synthetic data.
- New
- Research Article
- 10.1016/j.autcon.2025.106543
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106562
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106436
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106538
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106568
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106551
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106492
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106535
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106534
- Dec 1, 2025
- Automation in Construction
- New
- Research Article
- 10.1016/j.autcon.2025.106565
- Dec 1, 2025
- Automation in Construction
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.