Data dictionaries: essential tools for the ethical and transparent use of integrated data
Data transparency lays the groundwork for the ethical use of administrative data. This is particularly true for linked administrative data within integrated data systems (IDS). Data dictionaries, resources that maintain the metadata of the information housed in an IDS, offer a tool to ensure transparency throughout the data life cycle. The FAIR Principles, which assert that data be Findable, Accessible, Interoperable, and Reusable provide a useful framework by which to measure the effectiveness of data dictionaries in the IDS context. This paper uses the FAIR Principles to discuss the ways in which data dictionaries serve as tools in the ethical and transparent use of integrated data as well as the challenges that remain. Linked administrative data is a valuable source of information for programmatic and academic research. Data dictionaries facilitate the ethical handling of this sensitive information and maintain a commitment to transparency in data inquiry and research.
37
- 10.1016/j.jbi.2011.10.006
- Oct 29, 2011
- Journal of Biomedical Informatics
5
- 10.1111/jlme.12140
- Jan 1, 2014
- The Journal of law, medicine & ethics : a journal of the American Society of Law, Medicine & Ethics
1
- 10.23889/ijpds.v8i4.2159
- Oct 4, 2023
- International Journal of Population Data Science
14
- 10.23889/ijpds.v5i3.1367
- Sep 30, 2020
- International Journal of Population Data Science
119
- 10.1016/j.jbi.2020.103421
- May 12, 2020
- Journal of Biomedical Informatics
262
- 10.1177/2053951717745678
- Dec 1, 2017
- Big Data & Society
11886
- 10.1038/sdata.2016.18
- Mar 15, 2016
- Scientific Data
5
- 10.1007/0-306-46991-x_7
- Jan 1, 2002
1089
- 10.7551/mitpress/11805.001.0001
- Mar 10, 2020
303
- 10.1146/annurev-publhealth-031210-100700
- Apr 18, 2011
- Annual Review of Public Health
- Research Article
5
- 10.2139/ssrn.2512590
- Oct 22, 2014
- SSRN Electronic Journal
This report responds to a Workforce Data Quality Initiative (WDQI) challenge — the unreported quality of person identification (PI) features in many integrated data systems (IDS) that link confidential workforce, education and social services administrative records.The importance of the PI topic reflects concern that many local K-12 education agencies do not collect student Social Security Numbers. Some conclude from this widespread omission that linkage of secondary student records with workforce data may be impossible. However, others have adopted ad hoc and commercial software solutions to bridge this gap. To date no standard record linkage method has been endorsed.Will performance dashboards and research findings based on IDS information be accepted as trustworthy by individuals making important appropriation of funds, policy and program-level resource allocation decisions? Should IDS public-use releases be believed and acted upon?A standard technical language is used in professional communication about PI topics. Record linkage can be pursued using exact matching or statistical matching. Within the exact matching portfolio are deterministic and probabilistic methods. And within the deterministic portfolio are direct and hierarchical methods.A familiar first step among WDQI award teams is application of exact matching when two or more administrative data files each contains a SSN field. This first step is also the last step in some record linkage actions, which introduces selection bias threats, singly or in various combinations. Confirmation that a SSN has been issued, and is therefore valid, does not mean that the valid nine-digit SSN was issued to the person associated with this SSN in one or more administrative data files.We completed a series of three record linkage steps: (1) determine what candidate identifiers are available in each administrative data set; (2) use Link Plus software to carry out multiple deterministic and probabilistic PI diagnostics; and (3) examine the potential matched pairs identified in step two, assigning each pair to one of three categories — match, non-match, or uncertain match. Our intent has been to illustrate typical PI accuracy challenges that are found in administrative data files. These challenges occur over time within a single administrative data source and among different administrative data files.Our diagnostic findings are not amenable to summary coverage. Sections 5 and 6 describe what steps we undertook and what we found. Given our diagnostic findings to date: So what? If left unresolved, can a PI of unreported and perhaps unknown quality translate into unacceptable deficiencies in information, conclusions and recommendations that are released to stakeholders making important decisions about appropriation of funds, policies and program-level priorities?PI accuracy is a necessary first step for successful integration of multiple administrative data sources. This is a universal requirement that applies to any and all attempts to link unit-record person specific administrative data sources.Avoidance of stakeholder skepticism — rejection at worst — is within our collective control, but we need to take positive steps now to retain this control. Lost confidence is difficult to recover. We need to be out in front of this potential threat to realization of the return on past, current and future IDS investments.We are not aware of an ongoing serious and sustained professional conversation about the criteria that are appropriate to define PI accuracy tolerances for specific applications. This conversation is needed because the community of practitioners does not know whether we are over- or under-investing in PI technologies and applications.We encourage the U.S. Department of Labor, Employment and Training Administration WDQI leadership team to propose an appropriate forum — perhaps through the technical assistance resources of Social Policy Research Associates — to ensure immediate attention to the PI accuracy topic.
- Research Article
- 10.4314/jorind.v13i1
- Aug 6, 2015
- Journal of Research in National Development
The assessment of Integrated Information System (IIS) in organisation is an important initiative to enable the Information System (IS) managers, as well as top management to understand the success status of their investment in IS integration efforts. However, without a proper assessment, an organisation will not know its IIS status, which may affect their judgment on what action should be taken onwards. Current research on IIS assessment is lacking and those related literature on IIS assessment focus more on assessing the technical aspect of IIS. It is argued that assessing technical aspect alone is inadequate since organisational and strategic aspects in IIS should also be considered. Current methods, techniques and tools used by vendors for IIS assessment also are lack of comprehensive measures to fully assess the Integrated Information System in term of technical, organisational and strategic domains. The purpose of this study is to establish critical success factors for measuring success of an Integrated Information System. These factors are used as the basis for constructing an approach to comprehensively assess IIS in an organisation. A comprehensive list of success factors for IIS assessment, established from literature, was initially presented. An expert surveys using both manual and online methods were conducted to verify the factors. Based on the factors, an instrument for IIS assessment was constructed. The results from a case study indicate that through comprehensive assessment approach, not only the level of success been known, but also reveals the contributing factors. This research contributes to the field of Information Systems specifically in the area of Integrated Information System assessment. Keywords: Integrated Information System, assessment, technical aspect, organisation, management
- Research Article
5
- 10.1055/s-0040-1712510
- Feb 1, 2020
- Methods of Information in Medicine
There is a recognized need to improve how scholarly data are managed and accessed. The scientific community has proposed the findable, accessible, interoperable, and reusable (FAIR) data principles to address this issue. The objective of this case study was to develop a system for improving the FAIRness of Healthcare Cost and Utilization Project's State Emergency Department Databases (HCUP's SEDD) within the context of data catalog availability. A search tool, EDCat (Emergency Department Catalog), was designed to improve the "FAIRness" of electronic health databases and tested on datasets from HCUP-SEDD. ElasticSearch was used as a database for EDCat's search engine. Datasets were curated and defined. Searchable data dictionary-related elements and unified medical language system (UMLS) concepts were included in the curated metadata. Functionality to standardize search terms using UMLS concepts was added to the user interface. The EDCat system improved the overall FAIRness of HCUP-SEDD by improving the findability of individual datasets and increasing the efficacy of searches for specific data elements and data types. The databases considered for this case study were limited in number as few data distributors make the data dictionaries of datasets available. The publication of data dictionaries should be encouraged through the FAIR principles, and further efforts should be made to improve the specificity and measurability of the FAIR principles. In this case study, the distribution of datasets from HCUP-SEDD was made more FAIR through the development of a search tool, EDCat. EDCat will be evaluated and developed further to include datasets from other sources.
- Research Article
- 10.23889/ijpds.v5i5.1638
- Dec 7, 2020
- International Journal of Population Data Science
BackgroundPublic agencies hold important, yet largely unused, administrative data on the families and communities they serve. Integrated Data Systems (IDS) provide the governance process, legal framework, technology, and human capacity to connect these families and communities across data siloes. By securely linking administrative data across siloes, IDS are able to support data-informed decision making.
 IntroductionFor 10+ years, AISP has helped jurisdictions through the developmental process of building IDS. We operate a network of 22 U.S. states and counties with fully-functioning Integrated Data Systems, and provide technical assistance to 18 jurisdictions at various stages of IDS development.
 Objectives and ApproachThis session presents the outcomes of an independent evaluation of our Learning Community initiative (2019) and presents a new developmental framework that outlines key dimensions of quality and readiness for IDS building and implementation.
 ResultsAs of 2020, 20 sites have received formal 18-month cohort based technical assistance. This presentation will discuss site-based approaches to facilitate data sharing, including common challenges and solutions, and progress to date, including findings of an independent evaluation (2019). We will also present a framework developed based on the deep knowledge developed through technical assistance efforts, and findings from a national survey of data integration efforts conducted in 2020.
 The framework uses purpose, partnership structure, technical architecture, and organizational model—with respect to the strengths and challenges of each—to categorize and synthesize data integration efforts for social policy and program improvement. The developmental approach to our work emphasizes that we seek to understand methods for sustainability in diverse ways.
 Conclusion / ImplicationsWhile there is broad agreement in the value of integrating data across domains, developing the capacity and skills necessary to link administrative data for policy evaluation and research remains an elusive goal. Initial results indicate that an individualized yet collaborative technical assistance approach is successful in developing data integration capacity.
- Research Article
3
- 10.1080/07399018608965272
- Jan 1, 1986
- Journal of Information Systems Management
In the past several years, profound changes have occurred in business data processing. Integrated information systems, separation of data from application programs, large secondary memory devices, data base managers, distributed computing, and data dictionaries have contributed to the concept that data is a corporate resource. The data administration function is now widely considered to be the key to more effective long-term data resource management.
- Research Article
- 10.23889/ijpds.v9i5.2547
- Sep 10, 2024
- International Journal of Population Data Science
Objective and ApproachCross-sector data sharing and linkage can transform information about individuals into actionable intelligence to build stronger, healthier, and more just communities. Yet, the use of cross-sector data can also reinforce legacies of racist policies and produce inequitable resource allocation, access, and outcomes. To avoid this, we must embed data equity practices throughout the data life cycle and provide mechanisms for community voice and input. However, few integrated data systems are doing this well. To combat this challenge, AISP designed a 30-month technical assistance program–the Equity in Practice Learning Community (EiPLC)--to collaboratively develop guidance and models for centering racial equity in data integration. This session provides an overview of our EiPLC Scope & Sequence, a curriculum we designed to guide this work. Currently, 10 jurisdictions in the U.S. are receiving coaching around this curriculum and shifting their data integration practices to advance equity. ResultsTo date, progress has been nonlinear, with positive results. Work in action across the cohorts includes changes to cross-agency data governance, collaborative research agendas, legal agreements, staffing, and community participation practices including. ConclusionsWhile these efforts are nascent, progress is evident and site team members indicate growth as individuals and as site teams. Key features of success include a focus on building relationships, establishing working norms, grappling with racialized histories, and interrogating the role of structural racism and systems of power. ImplicationsThe EiPLC provides a successful model for supporting sites to embed data equity principles throughout the data life cycle.
- Book Chapter
2
- 10.1007/978-0-387-35501-6_13
- Jan 1, 2000
This brief update describes research on the maintenance of integrity in information systems via the establishment of an organisational environment that will prevent the damage caused by external agents. The main focus is on the development of information security metrics and a computer simulation model for the threat of computer viruses in organisations. Early results from this research project were presented at the IFIP TC11 Working Group 11.5’s Second Working Conference on Integrity and Internal Control in Information Systems (Mon and Gove, 1998). This brief update summarizes subsequent work conducted by a team comprising Science Applications International Corporation, Science Communication Studies and the Towson University Applied Mathematics Laboratory (SAIC 1999a, 1999b).
- Research Article
- 10.2139/ssrn.2512589
- Jan 20, 2013
- SSRN Electronic Journal
Neglecting the 'L' in a Longitudinal Integrated Data System Can Be a Costly Mistake
- Book Chapter
1
- 10.1016/b978-0-7506-1038-4.50013-0
- Jan 1, 1990
- Designing Information Systems
Chapter 10 - The data dictionary
- Research Article
2
- 10.1186/s13023-024-03193-y
- May 6, 2024
- Orphanet journal of rare diseases
BackgroundRare disease registries (RDRs) are valuable tools for improving clinical care and advancing research. However, they often vary qualitatively, structurally, and operationally in ways that can determine their potential utility as a source of evidence to support decision-making regarding the approval and funding of new treatments for rare diseases.ObjectivesThe goal of this research project was to review the literature on rare disease registries and identify best practices to improve the quality of RDRs.MethodsIn this scoping review, we searched MEDLINE and EMBASE as well as the websites of regulatory bodies and health technology assessment agencies from 2010 to April 2023 for literature offering guidance or recommendations to ensure, improve, or maintain quality RDRs.ResultsThe search yielded 1,175 unique references, of which 64 met the inclusion criteria. The characteristics of RDRs deemed to be relevant to their quality align with three main domains and several sub-domains considered to be best practices for quality RDRs: (1) governance (registry purpose and description; governance structure; stakeholder engagement; sustainability; ethics/legal/privacy; data governance; documentation; and training and support); (2) data (standardized disease classification; common data elements; data dictionary; data collection; data quality and assurance; and data analysis and reporting); and (3) information technology (IT) infrastructure (physical and virtual infrastructure; and software infrastructure guided by FAIR principles (Findability; Accessibility; Interoperability; and Reusability).ConclusionsAlthough RDRs face numerous challenges due to their small and dispersed populations, RDRs can generate quality data to support healthcare decision-making through the use of standards and principles on strong governance, quality data practices, and IT infrastructure.
- Preprint Article
- 10.5194/egusphere-egu25-9663
- Mar 18, 2025
CSV and Excel formats are among the most common storage formats for data sharing, especially in scientific and government contexts. Chaves-Fraga notes that a significant amount of public data is published in tabular formats such as CSV and Excel, which can hinder data accessibility and interoperability due to their lack of standardized metadata (Chaves-Fraga,  2020). This is in line with the findings of Burg et al. (2019). They highlight that although CSV files are widely used due to their simplicity, they often lack the necessary metadata to ensure data quality and provenance, which are crucial for compliance with the FAIR principles. Furthermore, Kaur et al. (2021) highlight that many health information systems allow data to be exported in CSV format, which is accessible but does not provide the semantic interoperability needed for effective data sharing and reuse. Furthermore, the limitations of CSV and Excel formats are compounded when datasets are converted to SQLite databases.The NFS group (NuoroForestrySchool.io) has developed an open source Python-based application (https://gitlab.com/NuoroForestrySchool/nfs-data-documentation-procedure) that facilitates the organization of the data a researcher is willing to share. The application is designed to be used as a command line tool or through a graphical interface. It reads as input a spreadsheet file with one sheet for each table, plus an application-specific sheet defining the database schema, the data dictionary, the DataCite metadata, and other specific metadata (extended title, abstract/summary). The output of the procedure is represented by a SQLite file containing all the data and metadata, as well as an image of the graphical ERD-like schema, and a formal pdf document presenting the contents of the database. The SQLite file is a metadata-rich SQL-based database, taking full advantage of relational features and thus improving data accessibility, interoperability, and reusability by humans and machines.The use of the procedure is demonstrated by processing a simple but significant use case.LITERATURE Chaves-Fraga, David, Edna Ruckhaus, Freddy Priyatna, Maria-Esther Vidal, e Oscar Corcho. 2021. «Enhancing virtual ontology based access over tabular data with Morph-CSV». A cura di Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, Ruben Verborgh, Muhammad Saleem, Ruben Verborgh, Muhammad Intizar Ali, e Olaf Hartig. Semantic Web 12 (6): 869–902. https://doi.org/10.3233/SW-210432. Kaur, Jasleen, Jasmine Kaur, Shruti Kapoor, e Harpreet Singh. 2021. «Design & Development of Customizable Web API for Interoperability of Antimicrobial Resistance Data». Scientific Reports 11 (1): 11226. https://doi.org/10.1038/s41598-021-90601-z. Van Den Burg, G. J. J., A. Nazábal, e C. Sutton. 2019. «Wrangling Messy CSV Files by Detecting Row and Type Patterns». Data Mining and Knowledge Discovery 33 (6): 1799–1820. https://doi.org/10.1007/s10618-019-00646-y.
- Conference Article
1
- 10.5281/zenodo.2625721
- Apr 10, 2019
The SHARC (SHAring Reward & Credit) interest group (IG) is an interdisciplinary group set up in the framework of RDA (Research Data Alliance) to improve crediting and rewarding mechanisms in the sharing process throughout the data life cycle. Notably, one of the objectives is to promote data sharing activities in research assessment schemes at national and European levels. To this aim, the RDA-SHARC IG is developing assessment grids using criteria to establish if data are compliant to the FAIR principles (findable /accessible / interoperable / reusable). The grid is aiming to be extensive, generic and trans-disciplinary. It is meant to be used by evaluators to assess the quality of the sharing practice of the researcher/scientist over a given period, taking into account the means & support available over that period. The grid displays a mind-mapped tree-graph structure based on previous works on FAIR data management (Reymonet et al., 2018; Wilkinson et al., 2016; Wilkinson et al., 2018; and E.U.Guidelines about FAIRness Data Management Plans). The criteria used are based on the work from FORCE 11*, and the Open Science Career Assessment Matrix designed by the EC Working group on Rewards under Open science. The criteria are organised in 5 clusters: ‘Motivations for sharing’; ‘Findable’, ‘Accessible’, ‘Interoperable’ and ‘Reusable’. For each criterion, 4 graduations are proposed (‘Never / Not Assessable’; ‘If mandatory’; ‘Sometimes’; ‘Always’). Only one value must be selected per criterion. Evaluation should be done by cluster; the final overall assessment will be based on the sum of the number of each ticked value / total number of criteria in each cluster; the ‘motivations for sharing’ should be appreciated qualitatively in the final interpretation. The final goals are to develop a graduated assessment of the researcher FAIRness literacy and help identifying needs to build FAIRness guidelines to improve the sharing capacity of researchers.
- Research Article
1
- 10.12688/hrbopenres.13215.1
- Feb 9, 2021
- HRB open research
Background: This study aims to examine the potential of currently available administrative health data for palliative and end-of-life care (PEoLC) research in Ireland. Objectives include to i) identify administrative health data sources for PEoLC research ii) describe the challenges and opportunities of using these and iii) estimate the impact of recent health system reforms and changes to data protection laws. Methods: The 2017 Health Information and Quality Authority catalogue of health and social care datasets was cross-referenced with a recognised list of diseases with associated palliative care needs. Criteria to assess the datasets included population coverage, data collected, data dictionary and data model availability and mechanisms for data access. Results: Eight datasets with potential for PEoLC research were identified, including four disease registries, (cancer, cystic fibrosis, motor neurone and interstitial lung disease), death certificate data, hospital episode data, community prescription data and one national survey. The ad hoc development of the health system in Ireland has resulted in i) a fragmented information infrastructure resulting in gaps in data collections particularly in the primary and community care sector where much palliative care is delivered, ii) ill-defined data governance arrangements across service providers, many of whom are not part of the publically funded health service and iii) systemic and temporal issues that affect data quality. Initiatives to improve data collections include introduction of i) patient unique identifiers, ii) health entity identifiers and iii) integration of the eircode postcodes. Recently enacted general data protection and health research regulations will clarify legal and ethical requirements for data use. Conclusions: With appropriate permissions, detailed knowledge of the datasets and good study design currently available administrative health data can be used for PEoLC research. Ongoing reform initiatives and recent changes to data privacy laws will facilitate future use of administrative health data for PEoLC research.
- Research Article
6
- 10.1016/j.evalprogplan.2022.102093
- Apr 22, 2022
- Evaluation and program planning
Leveraging integrated data for program evaluation: Recommendations from the field
- Conference Article
1
- 10.1109/hicss.1992.183503
- Jan 1, 1992
Describes a method to integrate information systems (IS) design performance evaluation with the IS development process. The nature and cost of an information system is shaped by decisions about where to provide computer support, the hardware platform, and the database management system (DBMS) architecture-decisions influenced largely by how quickly work must be done. A prototype system has been developed which produces simulation results automatically from data flow diagrams (DFDs) augmented with information regarding the performance of system components. The objective is to make the evaluation of IS design dynamics a common and integral part of the development process by producing simulation results directly from computer-aided software engineering (CASE) tool data dictionaries. The prototype reads DFDs from a custom DFD-drawing tool and formulates a corresponding simulation model. The prototype provides model-based expert advice in the use of the simulation model and in the interpretation of its output. The use of the prototype is illustrated through its application to a proposed research clinic information system.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">></ETX>
- New
- Research Article
- 10.23889/ijpds.v10i2.2966
- Nov 5, 2025
- International Journal of Population Data Science
- New
- Research Article
- 10.23889/ijpds.v10i1.2976
- Nov 4, 2025
- International Journal of Population Data Science
- New
- Research Article
- 10.23889/ijpds.v10i1.2463
- Nov 3, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i2.2972
- Oct 28, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i1.2923
- Oct 20, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i2.2956
- Oct 13, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i5.3332
- Oct 6, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i5.3329
- Oct 6, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i5.3327
- Oct 6, 2025
- International Journal of Population Data Science
- Research Article
- 10.23889/ijpds.v10i5.3348
- Oct 6, 2025
- International Journal of Population Data Science
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.