Abstract

Big data approaches have greatly improved scientific decision making, but they are highly dependent on the availability of data, impeding their use in data-poor scenarios. In addition to data abundance, enhancing data diversity is likewise a way to access knowledge. Herein, we propose a data-driven method for toxicity endpoint selection when directly relevant data are deficient, and shale gas exploitation sites were used as an example scenario. From the 1173 substances in the U.S. Environmental Protection Agency’s HFList, the most concerning endpoints in zebrafish embryo toxicity tests (FET) were inferred using a newly developed relational database (RDB) strategy that integrated chemical, high-throughput screening (HTS) bioactivity, genome, and FET endpoint information. This RDB strategy based on text mining and data fusion approaches enabled the integration of 255 bioactive contaminants, 955 HTS bioassays with known modes of action (MoAs), 214 gene ontologies, 65 pathways, and 27 phenotypic data and predicted measurement endpoints within 10 MoAs for shale gas pollution. This data-driven approach was further validated using zebrafish FET and transcriptomic sequencing with field-collected samples and achieved 89% and 97% accuracy for the predictive ontologies and pathways, respectively. This highlighted the applicability of RDB-based data-driven strategies for predicting toxicity endpoints from a priori knowledge of contaminants by improving data diversity.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call