Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020033
Synthetic and Encoded Database of Dengue, Zika, Chikungunya, and Influenza Derived from the Literature
  • Feb 6, 2026
  • Data
  • Elí Cruz-Parada + 16 more

This work presents a synthetic binary database of Dengue, Zika, Chikungunya, and Influenza constructed entirely from clinical information extracted from the scientific literature. Due to the limited availability and heterogeneity of clinical records in medical units—particularly for arboviral diseases—existing datasets are often insufficient for developing robust Machine Learning models. To address this limitation, an extensive search of PubMed and Google Scholar was conducted between February 2024 and May 2025, following strict selection criteria focused on diagnostic confirmation. The resulting dataset comprises 48,214 records and 67 standardized signs and symptoms, homogenized across all pathologies. Each record is fully binary, contains no missing values, and represents symptom presence or absence. The composition includes 22,379 Dengue records, 7,135 Zika records, 7,959 Chikungunya records, and 10,741 Influenza records. Symptom prevalence was analyzed, revealing consistency with patterns reported in epidemiological and clinical studies, supporting the dataset’s plausibility. This database enables statistical exploration and direct integration into Machine Learning pipelines without the need for imputation. It has been used in an in silico predictive study of arboviral diseases, employing Influenza as a negative control, and serves as a reproducible, literature-derived resource for computational modeling.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020031
An Integrated Environmental and Perceptual Dataset for Predicting Comfort in Smart Campuses During the Fall Semester
  • Feb 3, 2026
  • Data
  • Gianni Tumedei + 3 more

Indoor environmental comfort plays a central role in occupants’ well-being, learning outcomes, and productivity, especially in educational buildings characterized by high occupancy variability and diverse activities. This paper presents a real-world dataset collected at the Cesena Campus of the University of Bologna, aimed at supporting occupant-centric comfort analysis and prediction in classrooms and laboratories. The dataset integrates continuous environmental measurements, such as temperature, humidity, noise, air pressure, and CO2 concentration, with subjective comfort feedback gathered from students during regular lectures. Data were collected using permanently installed ceiling sensors and additional control sensors placed near occupants, enabling both longitudinal monitoring and validation analyses. Furthermore, the dataset includes both repeated comfort perception reports and a one-time comfort definition phase capturing individual relevance weights for different comfort dimensions. By combining objective and subjective data in realistic academic settings, the dataset provides a valuable resource for developing, benchmarking, and validating data-driven models for smart campus applications, indoor comfort prediction, and human-centered building analytics.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020027
Multiplex Immunofluorescence and Histopathology Dataset of Cell Cycle–Related Proteins in Renal Cell Carcinoma
  • Feb 1, 2026
  • Data
  • Hazem Abdullah + 12 more

Clear-cell renal cell carcinoma (ccRCC) accounts for the majority of kidney cancer diagnoses and exhibits widely variable clinical behaviour. The dataset described here was generated to support the discovery of robust biomarkers of tumour cell-cycle arrest and to inform the risk-stratified management of ccRCC. We assembled four independent cohorts including 480 patients from the UK arm of the SORCE adjuvant trial, 300 patients from a surgically treated series in Korea, 120 patients from a retrospective Scottish cohort, and a paired primary–metastatic cohort comprising 62 patients. Formalin-fixed paraffin-embedded nephrectomy specimens were processed for routine hematoxylin and eosin (H&E) histology, and for multiplex immunofluorescence (mIF). The mIF panels detect the cyclin-dependent kinase inhibitor p21CDKN1a, the DNA replication licencing factor MCM2, endoglin/CD105, Lamin B1 and nuclear DNA (Hoechst). Whole-slide images (WSIs) were acquired at high resolution, and artificial-intelligence pipelines were used to segment nuclei, classify individual cells into arrested phenotypes, and calculate the fraction of cells. Accompanying metadata include demographics, tumour stage, grade, Leibovich score, treatment arm (sorafenib/placebo), relapse events, and disease-free survival. All images and derived tables are released under a CC0 licence via the BioImage Archive, ensuring unrestricted reuse. This multi-cohort dataset provides a rich resource for studying cell-cycle arrest and proliferation markers, training image-analysis algorithms, and developing prognostic signatures in RCC.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020030
Refined IDRiD: An Enhanced Dataset for Diabetic Retinopathy Segmentation with Expert-Validated Annotations and Comprehensive Anatomical Context
  • Feb 1, 2026
  • Data
  • Sakon Chankhachon + 3 more

The Indian Diabetic Retinopathy Image Dataset (IDRiD) has been widely adopted for DR lesion segmentation research. However, it contains annotation gaps for proliferative DR lesions and labeling errors that limit its utility for comprehensive automated screening systems. We present Refined IDRiD, an enhanced version that addresses these limitations through (1) expert ophthalmologist validation and correction of labeling errors in original annotations for four non-proliferative lesions (microaneurysms, hemorrhages, hard exudates, cotton-wool spots), (2) the addition of three critical proliferative DR lesion annotations (neovascularization, vitreous hemorrhage, intraretinal microvascular abnormalities), and (3) the integration of comprehensive anatomical context (optic disc, fovea, blood vessels, retinal region). A team of three ophthalmologists (one senior specialist with >10 years’ experience, two expert fundus image annotators) conducted systematic annotation refinement, achieving an inter-rater agreement F1-score of 0.9012. The enhanced dataset comprises 81 high-resolution fundus images with pixel-level annotations for seven DR lesion types and four anatomical structures. All images were cropped to the retinal region of interest and resized to 1024 × 1024 pixels, with annotations stored as unified grayscale masks containing 12 classes enabling efficient multi-task learning. Refined IDRiD enables training of comprehensive DR screening systems capable of detecting both non-proliferative and proliferative stages while reducing false positives through anatomical context awareness.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020028
Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals
  • Feb 1, 2026
  • Data
  • Bobur Saidov + 8 more

This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 “manual-style” sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632/y2d5pcyrzz.3).

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020026
100 m Resolution Age-Stratified Population Grid Data for China Based on Township-Level in 2020
  • Feb 1, 2026
  • Data
  • Chen Liang + 13 more

China’s age structure is undergoing profound demographic shifts, making accurate spatial information on age-stratified populations essential for policy-making, resource allocation, and risk assessment. However, census data are primarily aggregated by administrative units, offering coarse spatial resolution that constrains their integration and application with other gridded datasets. Using township-level population counts for four age groups (0–14, 15–59, 60–64, and ≥65 years) from the 2020 Seventh National Population Census across 38,572 townships, we developed an age-stratified downscaling framework. This framework integrates a random forest model with age-filtered Points of Interest (POI) data and other multi-source geospatial covariates to generate a 100 m resolution age-stratified population density weighting layer. Through township-level data dasymetric mapping, we produced the township-based 100 m Age-Stratified Population Grid Data (Township-ASPOP). Since township-level data represent the finest publicly available spatial unit of demographic statistics in China, we further validated the accuracy of Township-ASPOP by generating County-based 100 m Age-Stratified Population Grid Data (County-ASPOP) through dasymetric mapping using county-level age-stratified population data. The results demonstrate that County-ASPOP achieves superior predictive accuracy, with R2 values of 0.95, 0.95, 0.85, and 0.86, and Root Mean Square Error (RMSE) values of 1743, 6829, 900, and 2033 persons per township for the four age groups, respectively—significantly outperforming the contemporaneous WorldPop dataset (R2 = 0.69, 0.72, 0.64, and 0.60). The accuracy of Township-ASPOP is no less than that of County-ASPOP and effectively captures realistic spatial settlement patterns. This study establishes a reproducible framework for generating age-stratified population grid data and provides critical data support for policy formulation and resource allocation.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020025
TGEconomicDataset: A Collection of Russian-Language Economic Telegram Channels and a Synthetic Data Generation Framework for Continuous Authentication
  • Jan 28, 2026
  • Data
  • Elena Luneva + 2 more

Telegram, along with WhatsApp and Signal, has become very popular due to its hybrid capabilities, including both instant private and public messaging, making it an effective tool for quickly broadcasting content to a wide audience. This article presents TGEconomicDataset, a new dataset containing more than 2.9 million messages from the most popular Russian-language Telegram channels in the field of economics, as well as synthetically generated labeled mixtures of these channels. These mixtures are specifically designed to model authorship change scenarios for testing various methods for solving the problem of continuous authentication, which is of particular interest due to the need for organizations and companies to rely on data posted on social media. The presented dataset is enriched with quotes of important financial instruments such as gold futures, the USD/RUB currency pair, BRENT oil, the dollar index (DXY), and bitcoin (BTC), synchronized with the message timestamps. A detailed joint analysis of the collected data is provided. In addition to the presented dataset, we publish the scripts used to collect the data, integrate the financial indicators, and generate the synthetic mixtures for the continuous authentication task, ensuring full reproducibility of the research.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11020024
Face Typicality–Distinctiveness Norms for the 304 Front-View Faces of the Glasgow Unfamiliar Face Database
  • Jan 26, 2026
  • Data
  • Paulo Ventura + 2 more

Face typicality and distinctiveness are key facial attributes that influence face recognition performance and the formation of social impressions. The present study aimed to provide normative data for these dimensions, offering a useful resource for face recognition research. Using a 7-point Likert scale, adult participants rated 304 front-facing faces from the Glasgow Unfamiliar Face Database (GUFD) for typicality–distinctiveness. Results indicated that the subjective rating method produced reliable estimates, with meaningful variability observed along the typicality–distinctiveness continuum. Highly distinctive faces were more sparsely represented in the database. These norms can support principled stimulus selection and improved methodological control in empirical research with faces.

  • New
  • Open Access Icon
  • Research Article
  • 10.3390/data11010023
A Reproducible FPGA–ADC Synchronization Architecture for High-Speed Data Acquisition
  • Jan 21, 2026
  • Data
  • Van Muoi Ngo + 1 more

High-speed data acquisition systems based on field-programmable gate arrays (FPGAs) often face synchronization challenges when interfacing with commercial analog-to-digital converters (ADCs), particularly under constrained hardware routing conditions and vendor-specific clocking assumptions. This work presents a vendor-independent FPGA–ADC synchronization architecture that enables reliable and repeatable high-speed data acquisition without relying on clock-capable input resources. Clock and frame signals are internally reconstructed and phase-aligned within the FPGA using mixed-mode clock management (MMCM) and input serializer/deserializer (ISERDES) resources, enabling time-sequential phase observation without the need for parallel snapshot or delay-line structures. Rather than targeting absolute metrological limits, the proposed approach emphasizes a reproducible and transparent data acquisition methodology applicable across heterogeneous FPGA–ADC platforms, in which clock synchronization is treated as a system-level design parameter affecting digital interface timing integrity and data reproducibility. Experimental validation using a custom Kintex-7 (XC7K325T) FPGA and an AFE7225 ADC demonstrates stable synchronization at sampling rates of up to 125 MS/s, with frequency-offset tolerance determined by the phase-tracking capability of the internal MMCM-based alignment loop. Consistent signal acquisition is achieved over the 100 kHz–20 MHz frequency range. The measured interface level timing uncertainty remains below 10 ps RMS, confirming robust clock and frame alignment. Meanwhile, the observed signal-to-noise ratio (SNR) performance, exceeding 80 dB, reflects the phase–noise-limited measurement quality of the system. The proposed architecture provides a cost-effective, scalable, and reproducible solution for experimental and research-oriented FPGA-based data acquisition systems operating under practical hardware constraints.

  • Open Access Icon
  • Research Article
  • 10.3390/data11010021
Dataset for Device-Free Wireless Sensing of Crowd Size in Public Transportation Environments
  • Jan 14, 2026
  • Data
  • Robin Janssens + 2 more

Congested platforms in public transportation systems can jeopardize the safety and comfort of passengers. Real-time crowd size estimation using Device-Free Wireless Sensing (DFWS) can offer a privacy-preserving solution for monitoring and preventing overcrowding. However, no public dataset exists on DFWS in public transportation environments. In this work, we introduce a new dataset comprising two different public transportation environments, which contains data on the presence of rail vehicles at the platform, as well as manual people counts at regular intervals. By providing this dataset, we aim to offer a foundation for other DFWS researchers to explore novel algorithms and methods in public transportation environments.