A MULTI-LAYER DELTA LAKEHOUSE FOR EPIDEMIOLOGICAL MONITORING AND FORECASTING UNDER EMERGENCIES
Public health emergencies demand fast, dependable analytics that combine real-time signals with trustworthy historical data. Open, interoperable platforms that support streaming and batch workflows can shorten the time from detection to action while preserving data quality and auditability. Aim: To design and justify an information system architecture for analyzing epidemic threats under emergency conditions that is scalable, reliable, and fit for integration with clinical and non-traditional data sources. Methods: We conducted a structured review of three data analytics architectures (Lambda, Kappa, Delta) and mapped their strengths and limits to crisis surveillance needs. Based on functional and non-functional requirements, we specified a Delta Lake–based lakehouse with bronze-silver-gold tiers, unified batch/stream ingestion with Spark Structured Streaming, ACID tables with time travel and schema control, and an analytics layer that supports forecasting with MLOps for monitoring, drift checks, retraining, and lineage. Results: The proposed architecture meets core emergency needs for timeliness, integrity, and reproducibility through ACID transactions, versioned datasets, and curated tiers; supports standards-based interoperability and the inclusion of wastewater, mobility, and other environmental feeds; provides a single code path for batch and streaming to reduce reconciliation burden; and sets operational guardrails for latency versus cost when running many near-real-time tables. We outline practical considerations for quality checks in the silver tier, promotion rules to gold, and model governance. Conclusions: A Delta-based lakehouse offers a clear path to an emergency-ready surveillance platform that scales with data growth, integrates heterogeneous sources, and supports reliable forecasting. The next steps are a pilot deployment with public health partners, live latency and cost measurements, and prospective validation of forecasting and alerting in real-world settings.
- Research Article
- 10.1055/s-0039-1677916
- Aug 1, 2019
- Yearbook of Medical Informatics
SummaryObjectives: With the explosive growth in availability of health data captured using non-traditional sources, the goal for this work was to evaluate the current biomedical literature on theory- driven studies investigating approaches that leverage non- traditional data in personalized medicine applications.Methods: We conducted a literature assessment guided by the personalized medicine unsolicited health information (pUHl) conceptual framework incorporating diffusion of innovations and task-technology fit theories.Results: The assessment provided an oveiview of the current literature and highlighted areas for future research. In particular, there is a need for: more research on the relationship between attributes of innovation and of societal structure on adoption; new study designs to enable flexible communication channels; more work to create and study approaches in healthcare settings; and more theory-driven studies with data-driven interventions.Conclusion: This work introduces to an informatics audience an elaboration on personalized medicine implementation with non-traditional data sources by blending it with the pUHl conceptual framework to help explain adoption. We highlight areas to pursue future theory-driven research on personalized medicine applications that leverage non-traditional data sources.
- Research Article
3
- 10.1002/fsh.10858
- Nov 1, 2022
- Fisheries
A constant challenge in fisheries stock assessment and management is having sufficient data to inform research and analyses. Nontraditional data sources like citizen science, when collected and applied appropriately, can help fill such data gaps. Use of nontraditional data sources is on the rise, but its use and application in fisheries science and management remains largely untapped. In order to examine the use of such data sources, we held a symposium at the 2020 American Fisheries Society Annual Meeting entitled “How Citizen Science and Nontraditional Data Sources can be Better Incorporated Into Fisheries Stock Assessments and Management” (https://bit.ly/3V13vHe). The session included 12 talks and a panel discussion to examine best practices. This paper reviews seven nontraditional data programs and projects used to support fisheries management featured in this special issue of Fisheries. It concludes with key lessons from the panel discussion for best applying nontraditional data sources in fisheries.
- Research Article
18
- 10.1016/j.amepre.2021.05.040
- Oct 19, 2021
- American Journal of Preventive Medicine
Bringing Iowa TelePrEP to Scale: A Qualitative Evaluation
- Research Article
7
- 10.1038/s41366-023-01331-3
- Jul 1, 2023
- International journal of obesity (2005)
The complex nature of obesity increasingly requires a comprehensive approach that includes the role of environmental factors. For understanding contextual determinants, the resources provided by technological advances could become a key factor in obesogenic environment research. This study aims to identify different sources of non-traditional data and their applications, considering the domains of obesogenic environments: physical, sociocultural, political and economic. We conducted a systematic search in PubMed, Scopus and LILACS databases by two independent groups of reviewers, from September to December 2021. We included those studies oriented to adult obesity research using non-traditional data sources, published in the last 5 years in English, Spanish or Portuguese. The overall reporting followed the PRISMA guidelines. The initial search yielded 1583 articles, 94 articles were kept for full-text screening, and 53 studies met the eligibility criteria and were included. We extracted information about countries of origin, study design, observation units, obesity-related outcomes, environment variables, and non-traditional data sources used. Our results revealed that most of the studies originated from high-income countries (86.54%) and used geospatial data within a GIS (76.67%), social networks (16.67%), and digital devices (11.66%) as data sources. Geospatial data were the most utilised data source and mainly contributed to the study of the physical domains of obesogenic environments, followed by social networks providing data to the analysis of the sociocultural domain. A gap in the literature exploring the political domain of environments was also evident. The disparities between countries are noticeable. Geospatial and social network data sources contributed to studying the physical and sociocultural environments, which could be a valuable complement to those traditionally used in obesity research. We propose the use of information available on the Internet, addressed by artificial intelligence-based tools, to increase the knowledge on political and economic dimensions of the obesogenic environment.
- Research Article
1
- 10.3389/fpubh.2024.1350743
- Mar 7, 2024
- Frontiers in Public Health
The COVID-19 pandemic prompted new interest in non-traditional data sources to inform response efforts and mitigate knowledge gaps. While non-traditional data offers some advantages over traditional data, it also raises concerns related to biases, representativity, informed consent and security vulnerabilities. This study focuses on three specific types of non-traditional data: mobility, social media, and participatory surveillance platform data. Qualitative results are presented on the successes, challenges, and recommendations of key informants who used these non-traditional data sources during the COVID-19 pandemic in Spain and Italy. A qualitative semi-structured methodology was conducted through interviews with experts in artificial intelligence, data science, epidemiology, and/or policy making who utilized non-traditional data in Spain or Italy during the pandemic. Questions focused on barriers and facilitators to data use, as well as opportunities for improving utility and uptake within public health. Interviews were transcribed, coded, and analyzed using the framework analysis method. Non-traditional data proved valuable in providing rapid results and filling data gaps, especially when traditional data faced delays. Increased data access and innovative collaborative efforts across sectors facilitated its use. Challenges included unreliable access and data quality concerns, particularly the lack of comprehensive demographic and geographic information. To further leverage non-traditional data, participants recommended prioritizing data governance, establishing data brokers, and sustaining multi-institutional collaborations. The value of non-traditional data was perceived as underutilized in public health surveillance, program evaluation and policymaking. Participants saw opportunities to integrate them into public health systems with the necessary investments in data pipelines, infrastructure, and technical capacity. While the utility of non-traditional data was demonstrated during the pandemic, opportunities exist to enhance its impact. Challenges reveal a need for data governance frameworks to guide practices and policies of use. Despite the perceived benefit of collaborations and improved data infrastructure, efforts are needed to strengthen and sustain them beyond the pandemic. Lessons from these findings can guide research institutions, multilateral organizations, governments, and public health authorities in optimizing the use of non-traditional data.
- Research Article
12
- 10.1371/currents.dis.d2800aa4e536b9d6849e966e91488003
- Jan 1, 2013
- PLoS Currents
<b>Background:</b> Hurricane Isaac made landfall in southeastern Louisiana in late August 2012, resulting in extensive storm surge and inland flooding. As the lead federal agency responsible for medical and public health response and recovery coordination, the Department of Health and Human Services (HHS) must have situational awareness to prepare for and address state and local requests for assistance following hurricanes. Both traditional and non-traditional data have been used to improve situational awareness in fields like disease surveillance and seismology. This study investigated whether non-traditional data (i.e., tweets and news reports) fill a void in traditional data reporting during hurricane response, as well as whether non-traditional data improve the timeliness for reporting identified HHS Essential Elements of Information (EEI). <b>Methods:</b> HHS EEIs provided the information collection guidance, and when the information indicated there was a potential public health threat, an event was identified and categorized within the larger scope of overall Hurricane Issac situational awareness. Tweets, news reports, press releases, and federal situation reports during Hurricane Isaac response were analyzed for information about EEIs. Data that pertained to the same EEI were linked together and given a unique event identification number to enable more detailed analysis of source content. Reports of sixteen unique events were examined for types of data sources reporting on the event and timeliness of the reports. <b>Results:</b> Of these sixteen unique events identified, six were reported by only a single data source, four were reported by two data sources, four were reported by three data sources, and two were reported by four or more data sources. For five of the events where news tweets were one of multiple sources of information about an event, the tweet occurred prior to the news report, press release, local government\\emergency management tweet, and federal situation report. In all circumstances where citizens were reporting along with other sources, the citizen tweet was the earliest notification of the event. <b>Conclusion:</b> Critical information is being shared by citizens, news organizations, and local government representatives. To have situational awareness for providing timely, life-saving public health and medical response following a hurricane, this study shows that non-traditional data sources should augment traditional data sources and can fill some of the gaps in traditional reporting. During a hurricane response where early event detection can save lives and reduce morbidity, tweets can provide a source of information for early warning. In times of limited budgets, investing technical and personnel resources to efficiently and effectively gather, curate, and analyze non-traditional data for improved situational awareness can yield a high return on investment.
- Discussion
3
- 10.1002/cpt.2335
- Jul 12, 2021
- Clinical Pharmacology and Therapeutics
A Sponsor’s View on Postmarketing Regulatory Commitments Involving Human Drug Products
- Abstract
- 10.1016/j.ijid.2020.09.913
- Dec 1, 2020
- International Journal of Infectious Diseases
Epicore: An infectious disease surveillance tool for field-based verification of public health events
- Book Chapter
1
- 10.1007/978-981-19-4460-4_10
- Jan 1, 2023
This chapter discusses aspects of data sources for budgeting and forecasting. It provides empirical evidence on the preference for data sources for a sample of experienced managers in the context of sales predictions. The authors show that managers still have strong preferences for traditional accounting data sources relative to non-traditional data sources. These preferences change between levels of education. Furthermore, the credibility (and not their use) of social media positively influences the preference for non-traditional data sources. These findings indicate that non-traditional data sources appear to coexist and become complementary to traditional accounting sources and do not substitute them.KeywordsBig DataSales predictionsForecastsBudgetingData sourcesSocial media
- Book Chapter
- 10.1007/978-3-319-68604-2_3
- Jan 1, 2017
The estimation of disease prevalence based on public health surveillance data requires the accurate identification of cases from limited information (e.g., diagnostic codes). These data sources typically consist of routinely collected records of population healthcare utilization, such as administrative and clinical data, that specifies diagnostic codes or terms for each encounter. These data sources include, for example, emergency department visits, pharmaceutical (drug) dispensations, and laboratory test orders. The case definitions depend on the data source and are typically based on the presence of diagnostic codes or key words in a prespecified time frame. Each data source will result in a certain degree of misclassification bias when estimating prevalence. Inaccuracies can occur at each stage from the time the disease process is initiated to the stage at which diagnostic codes are entered into the database. Indeed, when relying on these data sources, asymptomatic cases will be missed, as well as those not seeking health care. Even patients that seek care may be inaccurately diagnosed or the diagnostic code that is entered in the system may not represent the diagnosis or may not be a code or key word used in the definition. In addition to misclassification bias, these data sources are not usually available in a timely manner. Timeliness is an important factor for prevalence estimation in certain contexts such as the prevalence of infectious diseases during an epidemic. For instance, in an influenza pandemic, such estimates must be obtained within days. In recent years, several nonclinical and nontraditional data sources have been introduced to public health surveillance with the potential to provide more timely signals of changing prevalence trends. Ideally, combining the new and traditional data sources, there is greater potential to overcome bias and provide more timely signals. However, building a construct capable of incorporating data from these various sources in a coherent manner is not trivial. In this research, we consider the case of the 2009–2010 H1N1 pandemic as the context of interest and we use media reports of deaths from H1N1 on the web as a nontraditional data source. We propose to use dynamic Bayesian networks from the class of probabilistic graphical models in order to combine this new data source with traditional ones through exploration of the possible probabilistic relationships between these data streams. This is an initial step toward building a framework that can potentially support aggregation of heterogeneous data for a real-time estimation of disease prevalence. Our preliminary results show that the proposed model can be used in accurate prediction of short-term future counts of the data sources. This is particularly useful in timely prediction of epidemic changes over a defined population.
- Research Article
14
- 10.1097/phh.0b013e31826833ad
- Nov 1, 2012
- Journal of Public Health Management and Practice
Advancing the Science of Delivery
- Research Article
12
- 10.1016/j.clsr.2022.105667
- Mar 20, 2022
- Computer Law & Security Review
Commentators have predicted that the insurance industry will soon benefit from technological advancements, such as developments in Artificial Intelligence (‘AI’) and Big Data. The application of AI- and Big Data-powered tools promises cost reduction, the creation of innovative products, and the potential to offer more efficient and tailored services to consumers. However, these new opportunities are mirrored by new legal and regulatory challenges. This article discusses challenges facing Australian data protection law, focusing on (potential) collection of consumers' data by insurers from non-traditional sources. In particular, we examine situations in which consumers may not be aware that the data collected could end up being used to price insurance. In our analysis, we discuss two useful examples of such non-traditional data sources: customer loyalty schemes and social media. These may give rise to several concerning data practices, including a significant increase in the collection of consumers' data by insurers. We argue that datafication of insurer processes may fuel excessive data collection in the context of insurance contracts, generating a substantial risk of harm to consumers, especially in terms of discrimination, exclusion, and unaffordability of insurance. We complement our analysis with the discussion of Australian insurance-specific provisions, asking if, and how, the harms examined could be adequately addressed.
- Preprint Article
- 10.5194/egusphere-egu25-21408
- Mar 15, 2025
The rise in climate and weather-related risks such as floods, droughts and landslides affect millions of people and their properties. Early Warning Systems (EWS) coupled with anticipatory actions, are instrumental in tackling these threats. Water, a central focus of Sustainable Development Goal (SDG) 6, is integral to climate action and influences many other SDGs, emphasizing the need for accurate water-related data. The United Nations launched the Early Warnings for All (EW4All) initiative in November 2022 to ensure global EWS coverage. The quantity and quality hydrological data is critical for effective EWS and climate resilience. Moreover, the existence of different hydrological data from different sources, especially from non-traditional sources like machine learning (ML) and artificial intelligence (AI), remain underutilized by National Hydrological Services (NHS) and other users. Accessing and processing hydrological data is often challenging due to its heterogeneity, necessitating significant effort to harmonize and integrate disparate sources. These barriers hinder effective water management and issuing early warnings in time. The WMO State of Global Water Resources report 20231 highlights the urgency of addressing data access and availability issues. Easy access to relevant data relies on machine-to-machine communication, which remains challenging for many agencies. To address this, the WMO Hydrological Observing System (WHOS) provides an interoperable framework for data sharing, access and visibility using relevant technologies. It provides functionalities such as data publishing, standardization, visualization and linking global data centres and research communities. By integrating data from diverse sources, including ML/AI, global datasets, satellite observations, and individual researchers, WHOS enhances data visibility, fosters co-operation, and demonstrates the value of hydrological data collection. WHOS interfaces the big data and non-traditional data sources with NHS data systems using standardization and brokering approaches and open-source tools. WHOS employs tools and standards like OSCAR, WHOS DAB, WIS2Box, Hydroserver2.0, HydroShare, WDE, WMDS, WCMP2.0, OGC WaterML2.0, etc. OSCAR serves as WMO&#8217;s official metadata repository, enabling users to query and view observing stations. The Discovery and Access Broker (DAB) standardizes and harmonizes data, while WIS2Box simplifies data publication and download. HydroServer2.0 is an open-source data management tool accessible to all users including LDCs and SIDS. Standards such as WCMP2.0 and OGC WaterML2.0 support unified data discovery and access. Additionally, Topic Hierarchy for hydrology enables users to receive real-time data notifications by subscribing to a Message Queuing Protocol broker. The WHOS portal serves as a one stop data portal connecting hydrological data from countries, regional and basin organizations, research communities and global centres (IGRAC, GRDC, etc). Advances in AI, ML, satellite technology, and citizen science are resulting in vast amounts of data and WHOS integrates these data to support researchers, modelers and practitioners in water resource management. WHOS provides interoperable data to EW4All, Water Resources Management and HydroSOS systems by bridging gaps between research and operational applications. It supports transboundary cooperation, joint data monitoring and sharing, while demonstrating the return on investment in hydrological data collection. By harmonizing and sharing hydrological data, WHOS is instrumental in mitigating hydrological hazards and fostering global collaboration.&#160; 1 https://library.wmo.int/records/item/69033-state-of-global-water-resources-report-2023
- Research Article
2
- 10.1007/s13142-010-0002-2
- Dec 3, 2010
- Translational behavioral medicine
As the leading public health agency of the U.S. federal government, the Centers for Disease Control and Prevention (CDC) is committed to improving the public's health through practices that are known to make a difference. CDC has the responsibility to engage public health researchers, practitioners, and constituents in making evidence usable by public health practitioners. At the same time, identifying and understanding components of the process of translating research strategies to interventions used in practice are critical. CDC has led the way in articulating those components from a public health perspective and we will continue to work to realize the greatest public health impact from our research and programmatic initiatives. We are excited about the focus of Translational Behavioral Medicine and, in upcoming issues, look forward to sharing with readers some of the other work CDC is doing to effectively translate research into public health practice. As the leading public health agency of the U.S. federal government, the CDC is committed to improving the public's health through practices that are known to make a difference. This commitment is apparent throughout the agency and can be seen clearly in recent activities that aim to prevent and control chronic disease, such as the Communities Putting Prevention to Work (CPPW) initiative. In 2010, CDC awarded over $372 million dollars to states, tribes, and territorial jurisdictions to create healthier communities by reducing obesity, decreasing tobacco use, or both through sustainable, proven, population-based approaches such as broad-based policy, systems, organizational and environmental changes in communities and schools (see http://www.cdc.gov/Features/ChronicPreventionGrants/). CPPW requires applicants to adopt one or more evidence-based strategies that have been specified in a prescribed menu of effective practices. Results of these efforts should lead to measurable improvements in the public's health while informing CDC and its public health partners and stakeholders about a wide range of translation activities in action.
- Conference Article
1
- 10.1109/vlhcc.2016.7739701
- Sep 1, 2016
Information Visualization (InfoVis) often supports the analysis of structured data that is organized in documents with specific formats such as databases, Excel tables, or comma-separated files. Informal analyses that take place without anticipation and away from the desktop, however, might involve the use of data contained in digital artifacts that lack this structure (e.g., photographs, bitmaps, web pages). Such artifacts cannot provide immediate input for most existing visualization systems, as the data they contain does not exist as a set of variables with associated values. This research seeks to explore new opportunities in the design and implementation spaces of InfoVis authoring tools to support visualization in opportunistic scenarios. This document briefly defines the Opportunistic Visualization (OpportuVis) domain and describes iVoLVER, a research prototype that supports the construction of interactive visuals from non-traditional data sources. Future stages of this endeavor include the evaluation of iVoLVER from two perspectives: its analytical support and its usability features.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.