Scalable Analysis Research Articles

Recently, Large Language Models (LLMs) have transformed information retrieval, becoming widely adopted across various domains due to their ability to process extensive textual data and generate diverse insights. Biodiversity literature, with its broad range of topics, is no exception to this trend (Boyko et al. 2023, Castro et al. 2024). LLMs can help in information extraction and synthesis, text annotation and classification, and many other natural language processing tasks. We leverage LLMs to automate the information retrieval task from biodiversity publications, building upon data sourced from our previous work (Ahmed et al. 2024). In our previous work (Ahmed et al. 2023, Ahmed et al. 2024), we assessed the reproducibility of deep learning (DL) methods used in biodiversity research. We developed a manual pipeline to extract key information on DL pipelines—dataset, source code, open-source frameworks, model architecture, hyperparameters, software and hardware specs, randomness, averaging result and evaluation metrics from 61 publications (Ahmed et al. 2024). While this allowed analysis, it required extensive manual effort by domain experts, limiting scalability. To address this, we propose an automatic information extraction pipeline using LLMs with the Retrieval Augmented Generation (RAG) technique. RAG combines the retrieval of relevant documents with the generative capabilities of LLMs to enhance the quality and relevance of the extracted information. We employed an open-source LLM, Hugging Face implementation of Mixtral 8x7B (Jiang et al. 2024), a mixture of expert models in our pipeline (Fig. 1) and adapted the RAG pipeline from earlier work (Kommineni et al. 2024). The pipeline was run on a single NVIDIA A100 40GB graphics processing unit with 4-bit quantization. To evaluate our pipeline, we compared the expert-assisted manual approach with the LLM-assisted automatic approach. We measured their consistency using the inter-annotator agreement (IAA) and quantified it with the Cohen Kappa score (Pedregosa et al. 2011), where a higher score indicates more reliable and aligned outputs (1: maximum agreement, -1: no agreement). The Kappa score among human experts (annotators 1 and 2) was 0.54 (moderate agreement), while the scores comparing human experts with the LLM were 0.16 and 0.12 (slight agreement). The difference is partly due to human annotators having access to more information (including code, dataset, figures, tables and supplementary materials) than the LLM, which was restricted to the text itself. Given these restrictions, the results are promising but also show the potential to improve them by adding further modalities to the LLM inputs. Future work will involve several key improvements to our LLM-assisted information retrieval pipeline: Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Incorporating multimodal data (e.g., figures, tables, code, etc.) as input to the LLM, alongside text, to enhance the accuracy and comprehensiveness of the information retrieved from publications. Optimizing the retrieval component of the RAG framework with advanced techniques like semantic search, hybrid search or relevance feedback can improve the quality of outputs. Expanding the evaluation to a larger corpus of biodiversity literature could provide a more comprehensive understanding of pipeline capabilities, and this paves the way for pipeline optimization. A human-in-the-loop approach for evaluating the LLM-generated outputs by matching the ground truth values from the respective publications, will increase the quality of the overall pipeline. Employing more metrics for the evaluation beyond the Cohen Kappa score to better understand the LLM-assisted outputs. Leveraging LLMs to automate information retrieval from biodiversity publications signifies a notable advancement in the scalable and efficient analysis of biodiversity literature. Initial results show promise, yet there is substantial potential for enhancement through the integration of multimodal data, optimized retrieval mechanisms, and comprehensive evaluation. By addressing these areas, we aim to improve the accuracy and utility of our pipeline, ultimately enabling broader and more in-depth analysis of biodiversity literature.

Read full abstract

Electrochemical processes occurring during battery operation are highly complex. Thermodynamic, kinetic and transport-based phenomena all play a significant role in the dynamic evolution of the time-based electrochemical response of any battery system. Researchers have long sought to quantify these various interconnected phenomena, but to-date elucidating the role these mechanisms play in the performance of battery systems has been challenging or impossible. To aid researchers in uncovering the mechanistic behavior of their systems, we have developed a novel testing protocol and corresponding data analysis method that is scalable and non-destructive (SAND). This protocol and analysis can be applied to any battery system, in any form factor, at any point in the battery life cycle; including R&D, manufacturing, and second life battery screening to name a few.The testing protocol is based on a series of galvanostatic current pulses at a variety of currents and/or times. By utilizing the natural differences in time-constant of the various electrochemical phenomena inside a battery, this technique can probe each mechanism to glean novel insight using standard battery cycling equipment. Specifically, the insights this technique can provide include: the electrochemical potential of each phase during lithiation and delithiation quantification of electrode kinetics at various states-of-charge (including exchange current density estimation from the Butler-Volmer model)quantitative estimation of transport-based polarization losses from both the electrodes and the electrolyte. This analysis provides an un-parallelled level of mechanistic insight into the inner workings of any battery, at any point, spanning small-scale R&D cells to large format cells in mass production.To demonstrate the capabilities of this technique this presentation will focus on the analysis of multiple cells (same model) from a Tier 1 supplier containing a lithium transition metal oxide cathode material cycled at various temperatures. These results have led to an extensive set of conclusions. For brevity a few of these conclusions are: The initial exchange current density of these cells is calculated to be 25.5 A/m2. The H2→H3 phase transition during charging is the least efficient phase, exhibiting the largest power loss during chargingThe inefficiencies in the H2→H3 phase transition are dominated by solid-state lithium atom diffusion within the electrodes. At low current densities the overpotential associated with transport is the dominant source of losses, while at higher currents kinetic overpotentials become the dominant source. In summary, we have developed a novel characterization technique requiring only standard battery cycling equipment that can be easily and quickly performed by researchers around the world. Whether on small-scale coin cells or large-format pouch cells, the analysis from this technique will give the battery community insight into the thermodynamic, kinetic, and transport properties of their system. Moreover, this technique can be applied in a variety of contexts related to battery development, including understanding degradation mechanisms, performing root-cause analysis of manufacturing defects, or providing more accurate parameters for computational modeling.

Read full abstract

Scalable Analysis Research Articles

Articles published on Scalable Analysis

Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study

Optimising microplastics analysis for quantifying and identifying microplastic fibres in laundry wastewater

Quantitative assessment of human motion for health and rehabilitation: A novel fuzzy comprehensive evaluation approach

ScParser: sparse representation learning for scalable single-cell RNA sequencing data analysis

Scalable and Non-Destructive Analysis of Battery Degradation and Performance

Personalized Recommendation Systems Powered By Large Language Models: Integrating Semantic Understanding and User Preferences

Improving Prenatal Detection of Congenital Heart Disease With a Scalable Composite Analysis of 6 Fetal Cardiac Ultrasound Biometrics

Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models

FInCH: FIJI plugin for automated and scalable whole-image analysis of protein expression and cell morphology

Corporate network anomaly detection methodology utilizing machine learning algorithms

Multi-modal generative modeling for joint analysis of single-cell T cell receptor and gene expression data

Visibility graph-based covariance functions for scalable spatial analysis in non-convex partially Euclidean domains.

A robust CETSA data analysis automation workflow for routine screening

Custom Biomedical FAIR Data Analysis in the Cloud Using CAVATICA.

Multiscale Equation-Oriented Optimization Decreases the Carbon Intensity of Shale Gas to Liquid Fuel Processes.

INFLUTRUST: Trust-Based Influencer Marketing Campaigns in Online Social Networks

S-LDM: Server local dynamic map for 5G-based centralized enhanced collective perception

Efficient Static Vulnerability Analysis for JavaScript with Multiversion Dependency Graphs

Nonlinear optimization of optical camera multiparameter via triple integrated Gradient-based optimizer algorithm

Making sense of student feedback and engagement using artificial intelligence

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Scalable Analysis Research Articles

Articles published on Scalable Analysis

Automating Information Retrieval from Biodiversity Literature Using Large Language Models: A Case Study

Optimising microplastics analysis for quantifying and identifying microplastic fibres in laundry wastewater

Quantitative assessment of human motion for health and rehabilitation: A novel fuzzy comprehensive evaluation approach

ScParser: sparse representation learning for scalable single-cell RNA sequencing data analysis

Scalable and Non-Destructive Analysis of Battery Degradation and Performance

Personalized Recommendation Systems Powered By Large Language Models: Integrating Semantic Understanding and User Preferences

Improving Prenatal Detection of Congenital Heart Disease With a Scalable Composite Analysis of 6 Fetal Cardiac Ultrasound Biometrics

Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models

FInCH: FIJI plugin for automated and scalable whole-image analysis of protein expression and cell morphology

Corporate network anomaly detection methodology utilizing machine learning algorithms

Multi-modal generative modeling for joint analysis of single-cell T cell receptor and gene expression data

Visibility graph-based covariance functions for scalable spatial analysis in non-convex partially Euclidean domains.

A robust CETSA data analysis automation workflow for routine screening

Custom Biomedical FAIR Data Analysis in the Cloud Using CAVATICA.

Multiscale Equation-Oriented Optimization Decreases the Carbon Intensity of Shale Gas to Liquid Fuel Processes.

INFLUTRUST: Trust-Based Influencer Marketing Campaigns in Online Social Networks

S-LDM: Server local dynamic map for 5G-based centralized enhanced collective perception

Efficient Static Vulnerability Analysis for JavaScript with Multiversion Dependency Graphs

Nonlinear optimization of optical camera multiparameter via triple integrated Gradient-based optimizer algorithm

Making sense of student feedback and engagement using artificial intelligence