Abstract

Sharing health data could avoid duplication of effort in data collection, reduce unnecessary costs in future studies, and encourage collaboration and data flow within the scientific community. Several repositories from national institutions or research teams have making their datasets available. These data are mainly aggregated at spatial or temporal level, or dedicated to a specific field. The objective of this work is to propose a standardized storage and description of open datasets for research purposes. For this, we selected 8 publicly accessible datasets, covering the fields of demographics, employment, education and psychiatry. Then, we studied the format, nomenclature (i.e., files and variables names, modalities of recurrent qualitative variables) and descriptions of these datasets and we proposed on common and standardized format and description. We made available these datasets in an open gitlab repository. For each dataset, we proposed the raw data file in its original format, the cleaned data file in csv format, the variables description, the data management script and the descriptive statistics. Statistics are generated according to the type of variables previously documented. After one year of use, we will evaluate with the users if the standardization of the data sets is relevant and how they use the dataset in real life.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.