The METAPLANTCODE project is dedicated to advancing and optimizing pan-European case studies on metabarcoding. The project's objectives include providing best practice recommendations, optimizing analysis pipelines for species identification, and creating user-friendly reference databases. To accomplish these objectives, METAPLANTCODE will identify and address gaps in current methodologies, publish best practice documents on FAIR (Findable, Accessible, Interoperable, Reusable) data publishing for plant metabarcode data to GBIF (Global Biodiversity Information Facility) and the INSDC (International Nucleotide Sequence Database Collaboration), and implement ELIXIR-compatible multimodal deep learning (DL) models in novel tools for standalone metabarcoding analyses using various data sources. A significant focus of the project is enhancing species identification accuracy through GBIF records and metadata. This involves mapping regional, national, and international botanical taxonomic checklists, red lists, and floras to the Catalogue of Life (COL) via the COL ChecklistBank. Additionally, taxonomic and floristic literature will be semantically enriched with new entity recognition and relationship extraction modules, supporting the enhanced identification of species through domain-specific descriptive and phenotypic features. An interface will link taxonomic names to treatments, identify homonyms and synonyms, and facilitate the conversion and annotation of floras, red lists, and ecological treatments. All METAPLANTCODE products will adhere to FAIR standards by the project's end. The project emphasizes knowledge transfer from the outset, engaging with associated partners and stakeholders. Key stakeholders will be identified, priorities set, and communication channels established, monitored, and adjusted as necessary. Efforts to enhance stakeholder engagement, training, and outreach will ensure that plant metabarcoding becomes a routine standard for biodiversity monitoring in Europe and beyond. Deep Learning for Plant Metabarcoding Within the METAPLANTCODE project, our team is tasked with improving taxonomic precision by integrating deep learning on metabarcoding data and metadata. Previous studies have demonstrated the applicability of deep learning to non-plant barcoding data and its computational efficiency compared to traditional bioinformatics approaches (Flück et al. 2022). Deep Learning Models for Metabarcoding Data Our approach involves evaluating the efficacy of several deep learning models—such as Convolutional Neural Networks (CNN)(LeCun et al. 2015), Transformer models (Vaswani et al. 2017), Hyena (Poli et al. 2023), and Mamba architectures (Gu et al. 2023)—on plant barcoding datasets. Preliminary results will be presented, highlighting the application of these models and the proposed ensemble method (Mohammed and Kora 2023), which combines multiple barcode sequence representations and learning strategies. The ensemble approach, when integrated with classical machine learning models such as logistic regression and Support Vector Machines (SVM) (Noble 2006), is anticipated to offer improved precision and robustness compared to individual model applications (Fig. 1). Multimodal Refinement of Predictions In the subsequent phase, we aim to refine genetic sequence classifications by employing a multimodal strategy. This approach will integrate genetic information with traditional botanical knowledge. We will utilize biological interaction lists (e.g., species-species, species-habitat) provided by the METAPLANTCODE project to train a large language model (LLM) on relevant scientific literature. This LLM, specifically tailored for plant biodiversity, will incorporate metadata associated with genetic samples (including location, temporality, and climatic conditions). By merging embeddings of both metadata and genetic data, we aim to enhance the accuracy of taxonomic predictions (Fig. 2). Conclusion Through this research, we aim to develop an effective method for integrating genetic data with textual information from various sources. We anticipate that this approach will not only enhance plant metabarcoding but also be applicable to other barcoding fields, such as bacteria, fish, fungi, and more. Additionally, we expect this methodology to find broader applications in genomic research, providing valuable insights and improvements across diverse biological disciplines.
Read full abstract