Ghosts in the Data: The Contested Politics of Absence in Data Infrastructures

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Absences are inescapable in data. Data collection always focuses on some elements while occluding others. Yet, how absences are considered and recorded within data infrastructures markedly transforms the inferences that can be made. Tracing a genealogy from early databases to contemporary AI datasets, this paper explores how data infrastructures have grappled with the inherent incompleteness of data. Specifically, I uncover a tension between a desire for certainty and acknowledging partiality at the foundation of data science that continues to pervade contemporary AI datasets. Drawing on archival studies and sociological perspectives, I argue that data science must embrace uncertainty by recognizing the “ghosts in the data”—the uncounted, the unrepresented, and the silenced—and how their absence shapes the outcomes of automated systems.

Similar Papers
  • Book Chapter
  • 10.1007/978-3-030-17152-0_3
Do We Need a Critical Evaluation of the Role of Mathematics in Data Science?
  • Jan 1, 2019
  • Patrick Allo

A sound and effective data ethics requires an independent and mature epistemology of data science. We cannot address the ethical risks associated with data science if we cannot effectively diagnose its epistemological failures, and this is not possible if the outcomes, methods, and foundations of data science are themselves immune to criticism. An epistemology of data science that guards against the unreflective reliance on data science blocks this immunity. Critical evaluations of the epistemic significance of data and of the impact of design-decisions in software engineering already contribute to this enterprise but leave the role of mathematics within data science largely unexamined. In this chapter we take a first step to fill this gap. In a first part, we emphasise how data, code, and maths jointly enable data science, and how they contribute to the epistemic and scientific respectability of data science. This analysis reveals that if we leave out the role of mathematics, we cannot adequately explain how epistemic success in data science is possible. In a second part, we consider the more contentious dual issue: Do explanations of epistemic failures in data science also force us to critically assess the role of maths in data science? Here, we argue that mathematics not only contributes mathematical truths to data science, but also substantive epistemic values. If we evaluate these values against a sufficiently broad understanding of what counts as epistemic success and failure, our question should receive a positive answer.

  • Single Book
  • Cite Count Icon 6
  • 10.1201/9781003206743
Physics of Data Science and Machine Learning
  • Nov 1, 2021
  • Ijaz A Rauf

Physics of Data Science and Machine Learning links fundamental concepts of physics to data science, machine learning, and artificial intelligence for physicists looking to integrate these techniques into their work. This book is written explicitly for physicists, marrying quantum and statistical mechanics with modern data mining, data science, and machine learning. It also explains how to integrate these techniques into the design of experiments, while exploring neural networks and machine learning, building on fundamental concepts of statistical and quantum mechanics. This book is a self-learning tool for physicists looking to learn how to utilize data science and machine learning in their research. It will also be of interest to computer scientists and applied mathematicians, alongside graduate students looking to understand the basic concepts and foundations of data science, machine learning, and artificial intelligence. Although specifically written for physicists, it will also help provide non-physicists with an opportunity to understand the fundamental concepts from a physics perspective to aid in the development of new and innovative machine learning and artificial intelligence tools. Key Features: Introduces the design of experiments and digital twin concepts in simple lay terms for physicists to understand, adopt, and adapt. Free from endless derivations; instead, equations are presented and it is explained strategically why it is imperative to use them and how they will help in the task at hand. Illustrations and simple explanations help readers visualize and absorb the difficult-to-understand concepts. Ijaz A. Rauf is an adjunct professor at the School of Graduate Studies, York University, Toronto, Canada. He is also an associate researcher at Ryerson University, Toronto, Canada and president of the Eminent-Tech Corporation, Bradford, ON, Canada.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3328778.3367015
Innovation in Undergraduate Data Science Education
  • Feb 26, 2020
  • Eric Van Dusen + 2 more

The workshop will allow participants to gain experience with a series of innovations developed at UC Berkeley that have enabled the teaching of undergraduate data science at scale to students from all backgrounds. Rather than beginning with established introductory strategies as the gateway to computer science, students in the Foundations of Data Science (data8.org) learn computational skills and concepts in relation to real world issues and with attention to societal implications. By engaging with students' interest in the applications of computing on data, and integrating societal impact from the start, the program has developed long term commitment to advance computational skills for large numbers of students. These innovations in teaching not only convey important computational content, but also broaden participation beyond existing approaches to computer science, and integrate issues of human contexts and ethics throughout the full curriculum. Goals include increasing diversity among students learning computer science, giving students a strong ethical foundation within their computer science work, and encouraging critical thinking in the application of inference and statistical techniques. Bringing a laptop is recommended

  • Preprint Article
  • 10.5194/egusphere-egu25-9878
Theory and implementation of least-squares-based deep learning
  • Mar 18, 2025
  • Alireza Amiri-Simkooei

Big data is one of the most important phenomena of the 21st century, creating unique opportunities and challenges in its processing and interpretation. Machine learning (ML), a subset of artificial intelligence (AI), has become a foundation of data science, which enables applications ranging from computer vision, geoscience, aviation and medicine. ML becomes important when establishing mathematical models that connect explanatory variables to predicted variables is impossible due to complexity. Deep learning (DL), a subset of ML, has revolutionized fields such as speech recognition, email filtering, and time series analysis. However, DL methods face challenges such as high data demand, overfitting, and the “black box” problem.We review least-squares-based deep learning (LSBDL), a framework that combines the interpretability of linear least squares (LS) theory with the flexibility and power of deep learning (DL). LS theory, widely used in engineering and geosciences, provides powerful tools for parameter estimation, quality control, and reliability through linear models. DL, on the other hand, deals with modelling complex nonlinear relationships where the mapping between explanatory and predicted variables is unknown. LSBDL bridges these approaches by formulating DL within the LS framework: training networks to establish a design matrix, an essential element of linear models. Through this integration, LSBDL enhances DL with transparency, statistical inference, and reliability. Gradient descent methods such as steepest descent and Gauss-Newton methods are used to construct an adaptive design matrix. By combining the transparency of LS theory with the data-driven adaptability of DL, LSBDL addresses challenges in different fields including geoscience, aviation, and data science. This approach not only improves the interpretability of DL models, but also extends the applicability of LS theory to nonlinear and complex systems, offering new opportunities for innovation and research.By embedding statistical foundations in the DL workflow, LSBDL offers a three-fold advantage: i) Direct computation of covariance matrices for predicted outcomes allows for quantitative assessment of model uncertainty. ii) Well-established theories of hypothesis testing and outlier detection facilitate the identification of model misspecifications and outlying data, and iii) The covariance matrix of observations can be used to train networks with statistically correlated, inconsistent, or heterogeneous datasets. Incorporating least squares principles increases model explainability, a critical aspect of interpretable and explainable artificial intelligence, and bridges the gap between traditional statistical methods and modern DL techniques. For example, LSBDL can incorporate prior knowledge using soft and hard physics-based constraints, a technique known as physics-informed machine learning (PIML).The approach is illustrated through three illustrative examples: Surface fitting, time series forecasting, and groundwater storage downscaling. Beyond these examples, LSBDL offers opportunities for various applications including geoscience, inverse problems, aviation, data assimilation, sensor fusion, and time series analysis.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-031-70660-8_1
Generative Modelling of Stochastic Rotating Shallow Water Noise
  • Aug 9, 2024
  • Alexander Lobbe + 2 more

In recent work Crisan and co-authors (Foundations of Data Science, 2023), have developed a generic methodology for calibrating the noise in fluid dynamics stochastic partial differential equations where the stochasticity was introduced to parametrize subgrid-scale processes. The stochastic parameterization of sub-grid scale processes is required in the estimation of uncertainty in weather and climate predictions, to represent systematic model errors arising from subgrid-scale fluctuations. The methodology in Crisan (Foundations on Data Science, 2023) used a principal component analysis (PCA) technique based on the ansatz that the increments of the stochastic parametrization are normally distributed. In this chapter, the PCA technique is replaced by a generative model technique. This enables us to avoid imposing additional constraints on the increments. The methodology is tested on a stochastic rotating shallow water model with the elevation variable of the model used as input data. The numerical simulations show that the noise is indeed non-Gaussian. The generative modelling technology gives good RMSE, CRPS score and forecast rank histogram results.

  • Conference Instance
  • Cite Count Icon 2
  • 10.1145/2064227
Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
  • Oct 28, 2011

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation

  • Research Article
  • 10.1093/eurpub/ckaf161.300
6.A. Round table: Personalized Prevention Roadmap for the future Healthcare (PROPHET): SRIA and implementation Roadmap
  • Oct 1, 2025
  • European Journal of Public Health

Advancements in sequencing and genotyping technologies and the integration of digital resources in healthcare have ushered in a new era of medicine. According to PROPHET “personalised prevention aims to prevent onset, progression and recurrence of diseases through the adoption of targeted interventions that consider the biological information, environmental and behavioural characteristics, socio-economic and cultural context of individuals. This should be timely, effective and equitable to maintain the best possible balance in lifetime health trajectory”. By tailoring interventions based on risk profiles at population level, personalised prevention aims to delay disease onset, enhance quality of life and ultimately reduce the economic burden on healthcare systems. Although challenges exist in clinical implementation, genomics is the most advanced, providing examples of clinical utility such as genetic testing, polygenic risk scores, and pharmacogenomics. This SRIA considers the intricacies of the personalised prevention paradigm and elucidates the reasons for its crucial integration into European healthcare systems. After reviewing the latest research and incorporating various stakeholders’ perspectives, the SRIA identifies ten key challenges, including: The broad scope of promotion and prevention; Continuous evidence synthesis system supporting personalised prevention; The PROPHET Framework implementation; Data collection and integration, and Data Infrastructure; Community Engagement and trust; Health Professionals and Policy Makers involvement; Regulatory aspects and synergy with private sector; Access, Equity and Coverage; Ethical, Legal, Social Issues; Changing behaviour. The accompanying Roadmap provides a detailed blueprint for implementing tailored preventive strategies for each individual based on the latest scientific advancements and the specific needs of each context. The Roadmap outlines key goals, priority actions, implementation timelines, expected outcomes and output indicators, responsible entities, funding sources, and synergies with other EU initiatives, providing a structured plan for integrating personalised prevention into healthcare. For the ten challenges, we have identified 56 goals and 66 actions. These actions range from creating platforms and repositories for publications in the field of personalised prevention to improve evidence and interoperability across Europe, to the dissemination of the PROPHET framework in real-world settings, to the design and implementation of educational programs for professionals and citizens, and the establishment of regulations for data sharing and standardisation of data. The actions provide a detailed outline of what needs to be accomplished, as well as the potential obstacles that may arise. Timelines and potential funding actors have been outlined. The workshop represents the first event of dissemination of SRIA and Roadmap to the public health audience in order to gather insights and to foster the awareness of decision makers. Key messages • The SRIA on personalized prevention addresses a key action of the EU Cancer Plan and outlines ten challenges, including data collection, integration, and infrastructure. • The Roadmap represents a blueprint for the SRIA implementation through 56 goals and 66 actions, directed to decision makers at national and EU level, funders and scientists overall.

  • Research Article
  • 10.1089/big.2023.29057.rtd
Importance of Community Engagement in Data Decision Making.
  • Apr 1, 2023
  • Big data
  • Michael Crawford + 4 more

Importance of Community Engagement in Data Decision Making.

  • Research Article
  • Cite Count Icon 3
  • 10.1108/jd-08-2021-0159
Data as assemblage
  • Mar 3, 2022
  • Journal of Documentation
  • Ceilyn Boyd

PurposeA definition of data called data as assemblage is presented. The definition accommodates different forms and meanings of data; emphasizes data subjects and data workers; and reflects the sociotechnical aspects of data throughout its lifecycle of creation and use. A scalable assemblage model describing the anatomy and behavior of data, datasets and data infrastructures is also introduced.Design/methodology/approachData as assemblage is compared to common meanings of data. The assemblage model's elements and relationships also are defined, mapped to the anatomy of a US Census dataset and used to describe the structure of research data repositories.FindingsReplacing common data definitions with data as assemblage enriches information science and research data management (RDM) frameworks. Also, the assemblage model is shown to describe datasets and data infrastructures despite their differences in scale, composition and outward appearance.Originality/valueData as assemblage contributes a definition of data as mutable, portable, sociotechnical arrangements of material and symbolic components that serve as evidence. The definition is useful in information science and research data management contexts. The assemblage model contributes a scale-independent way to describe the structure and behavior of data, datasets and data infrastructures and supports analyses and comparisons involving them.

  • Research Article
  • 10.1093/eurpub/ckae144.779
Benefits of cross-border access to human genomes at scale for research and healthcare
  • Oct 28, 2024
  • European Journal of Public Health
  • S Scollen

Genomics data will soon be routinely generated and integrated into national healthcare systems. To maximise the potential of genomic medicine, data should be accessible for research where possible. This is the remit of ELIXIR, an intergovernmental organisation that brings together life science resources from across Europe. Innovative solutions are needed to ensure the validated research findings for disease or preventative medicine are then integrated into healthcare. To tackle this, the 1+MG initiative, a joint initiative of 25 EU countries, the UK, and Norway, aims to enable secure access to genomics and the corresponding clinical data across Europe for better research, personalised healthcare and health policy making. In the design and scale-up phase (B1MG project), recommendations and guidelines to advance towards the deployment of personalised medicine at a European scale have been produced, adopted by 1+MG and developed into a 1+MG framework. This includes guidance on data governance, standards, quality and infrastructure, recommendations on how to approach citizen engagement and a tool for countries to self-assess implementation into healthcare. The European Genomic Data Infrastructure (GDI) project supports the scale-up and sustainability phase of the 1+MG initiative, to deploy infrastructure across 24 countries to support the overall ambition. Recommendations are being used to promote governance and technical interoperability across European initiatives including the European Health Data Space (EHDS), and European Cancer Image Initiative (EUCAIM). The 1+MG will be established as a European Data Infrastructure Consortia in 2025 and will act as an Authorised Participant in the EHDS providing access to a permanent high-quality federated data collection of genomic and health data that will accelerate research, innovation and policymaking facilitating the deployment of genomic medicine across Europe.

  • Research Article
  • Cite Count Icon 6
  • 10.2139/ssrn.2376148
Small Data, Data Infrastructures and Big Data
  • Jan 8, 2014
  • SSRN Electronic Journal
  • Rob Kitchin + 1 more

Small Data, Data Infrastructures and Big Data

  • Front Matter
  • Cite Count Icon 27
  • 10.1016/j.ijrobp.2016.03.006
Overview of the American Society for Radiation Oncology–National Institutes of Health–American Association of Physicists in Medicine Workshop 2015: Exploring Opportunities for Radiation Oncology in the Era of Big Data
  • Jun 6, 2016
  • International Journal of Radiation Oncology*Biology*Physics
  • Stanley H Benedict + 27 more

Overview of the American Society for Radiation Oncology–National Institutes of Health–American Association of Physicists in Medicine Workshop 2015: Exploring Opportunities for Radiation Oncology in the Era of Big Data

  • Research Article
  • Cite Count Icon 15
  • 10.1108/lm-02-2020-0027
Open research data in African academic and research libraries: a literature analysis
  • Jun 3, 2020
  • Library Management
  • Elisha R.T Chiware

PurposeThe paper presents a literature review on research data management services in African academic and research libraries on the backdrop of the advancing open science and open research data infrastructures. It provides areas of focus for library to support open research data.Design/methodology/approachThe literature analysis and future role of African libraries in research data management services were based on three areas as follows:open science, research infrastructures and open data infrastructures. Focussed literature searches were conducted across several electronic databases and discovery platforms, and a qualitative content analysis approach was used to explore the themes based on a coded list.FindingsThe review reports of an environment where open science in Africa is still at developmental stages. Research infrastructures face funding and technical challenges. Data management services are in formative stages with progress reported in a few countries where open science and research data management policies have emerged, cyber and data infrastructures are being developed and limited data librarianship courses are being taught.Originality/valueThe role of the academic and research libraries in Africa remains important in higher education and the national systems of research and innovation. Libraries should continue to align with institutional and national trends in response to the provision of data management services and as partners in the development of research infrastructures.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icnsc.2013.6548739
The design of monitoring and data infrastructures — Applying a forward-thinking reference architecture
  • Apr 1, 2013
  • M Schroeder + 4 more

Climate Change is an extraordinary challenge for the development of our socioeconomic environment. The compilation of comprehensive knowledge about our physical environment is a key importance for implementing of mitigation strategies. Long-term terrestrial observatories are supporting the systematic monitoring of environmental parameters. They are responsible for data collection, data analysis and subsequently for decision support. Not only the complex structure and the large volume of data streams but also the necessary integration of existing monitoring infrastructures for such observatories imply special technological challenges for today's scientific data and information management. Recent developments of Information and Communication Technology provide important conceptual and technological input for the proper design and implementation of underlying monitoring and data infrastructures. To avoid constantly recurring system developments for such infrastructures, a general and integrated approach for a reference architecture concept is needed.

  • Research Article
  • Cite Count Icon 9
  • 10.1111/1752-1688.12439
Featured Collection Introduction: Open Water Data Initiative
  • Aug 1, 2016
  • JAWRA Journal of the American Water Resources Association
  • Jerad Bales

Featured Collection Introduction: Open Water Data Initiative

More from: Social Science Computer Review
  • New
  • Research Article
  • 10.1177/08944393251392916
Using Artificial Intelligence to Generate Visual Vignettes in Factorial Survey Experiments
  • Nov 3, 2025
  • Social Science Computer Review
  • Nicole Schwitter

  • New
  • Research Article
  • 10.1177/08944393251392913
Generative AI Usage by Individuals During the 2024 U.S. Presidential Election: Symmetrical and Asymmetrical Analysis
  • Oct 28, 2025
  • Social Science Computer Review
  • Wanli Liu + 2 more

  • Addendum
  • 10.1177/08944393251389280
Corrigendum to ‘Topic Modeling as a Tool to Analyze Child Abuse From the Corpus of English Newspapers in Pakistan’
  • Oct 27, 2025
  • Social Science Computer Review

  • Research Article
  • 10.1177/08944393251388098
Prompt Engineering for Large Language Model-Assisted Inductive Thematic Analysis
  • Oct 24, 2025
  • Social Science Computer Review
  • Muhammad Talal Khalid + 1 more

  • Research Article
  • 10.1177/08944393251390890
Take Action Now! A Longitudinal Study of Political Party Calls to Action Across Social Media Platforms
  • Oct 24, 2025
  • Social Science Computer Review
  • Anders Olof Larsson

  • Research Article
  • 10.1177/08944393251387282
Selective Exposure to News, Homogeneous Political Discussion Networks, and Affective Political Polarization: An Agent-Based Modeling of Minimal versus Strong Communication Effects
  • Oct 17, 2025
  • Social Science Computer Review
  • Homero Gil De Zúñiga + 2 more

  • Research Article
  • 10.1177/08944393251370354
Dialogues Towards Sociologies of Generative AI
  • Oct 16, 2025
  • Social Science Computer Review
  • Patrick Baert + 5 more

  • Research Article
  • 10.1177/08944393251388096
Riding the Tide: How Online Activists Leverage Repression
  • Oct 9, 2025
  • Social Science Computer Review
  • Hansol Kwak

  • Research Article
  • 10.1177/08944393251386073
Unpacking Divorce: Feature-Based Machine Learning Interpretation of Sociological Patterns
  • Oct 1, 2025
  • Social Science Computer Review
  • Hüseyin Doğan + 1 more

  • Research Article
  • 10.1177/08944393251382233
Welcome to the Brave New World: Lay Definitions of AI at Work and in Daily Life
  • Sep 25, 2025
  • Social Science Computer Review
  • Wenbo Li + 3 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon