ABSTRACTObjectiveData science and machine learning methodologies are essential to address complex scientific challenges across various domains. These advancements generate numerous research assets such as datasets, software tools, and workflows, which are shared within the open science community. Concurrently, computational notebook environments like Jupyter Notebook, along with platforms like Google Colab and Kaggle Kernel, facilitate data science research and machine learning workflows, transforming data analysis, model development, and knowledge sharing processes. The proliferation of computational notebooks has further enriched the pool of valuable research assets. Researchers frequently require efficient access to these assets to advance their work, yet current tools often require navigating multiple websites and portals, leading to inefficiency and information overload. The challenge is compounded when relying on general web search engines that might not adequately highlight niche scientific resources.MethodsTo address these issues, we propose the development of an innovative Multiple Research Asset Search (MRAS) system designed to index diverse research assets from heterogeneous sources, offering a unified search interface for researchers. Our system aims to significantly improve the discovery of computational notebooks and datasets, facilitating data‐driven research.ResultsWe developed a pipeline for data extraction and indexing, reviewed and applied state‐of‐the‐art ranking algorithms, enhanced indexing documents with content analysis, and created a Jupyter extension for asset discovery within the working environment.ConclusionThis work is structured to detail our approach, literature review, system development, empirical validation, results, and conclusions, illustrating the potential impact of our MRAS system on scientific research efficiency.
Read full abstract