Abstract

Understanding “how to optimize the production of scientific knowledge” is paramount to those who support scientific research—funders as well as research institutions—to the communities served, and to researchers. Structured archives can help all involved to learn what decisions and processes help or hinder the production of new knowledge. Using artificial intelligence (AI) and large language models (LLMs), we recently created the first structured digital representation of the historic archives of the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health. This work yielded a digital knowledge base of entities, topics, and documents that can be used to probe the inner workings of the Human Genome Project, a massive international public-private effort to sequence the human genome, and several of its offshoots like The Cancer Genome Atlas (TCGA) and the Encyclopedia of DNA Elements (ENCODE). The resulting knowledge base will be instrumental in understanding not only how the Human Genome Project and genomics research developed collaboratively, but also how scientific goals come to be formulated and evolve. Given the diverse and rich data used in this project, we evaluated the ethical implications of employing AI and LLMs to process and analyze this valuable archive. As the first computational investigation of the internal archives of a massive collaborative project with multiple funders and institutions, this study will inform future efforts to conduct similar investigations while also considering and minimizing ethical challenges. Our methodology and risk-mitigating measures could also inform future initiatives in developing standards for project planning, policymaking, enhancing transparency, and ensuring ethical utilization of artificial intelligence technologies and large language models in archive exploration.Author Contributions: Mohammad Hosseini: Investigation; Project Administration; Writing – original draft; Writing – review & editing. Spencer Hong: Conceptualization, Data curation, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. Thomas Stoeger: Conceptualization; Investigation; Project Administration; Supervision; Writing – original draft; Writing – review & editing. Kristi Holmes: Funding acquisition, Supervision, Writing – review & editing. Luis A. Nunes Amaral: Funding acquisition, Supervision, Writing – review & editing. Christopher Donohue: Conceptualization, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing. Kris Wetterstrand: Conceptualization, Funding acquisition, Project administration.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call