Abstract

Long terminal repeat (LTR) retrotransposons are mobile elements that constitute the major fraction of most plant genomes. The identification and annotation of these elements via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in genome size variation, LTR retrotransposons are also associated with the function and structure of different chromosomal regions and can alter the function of coding regions, among others. Several sequence databases of plant LTR retrotransposons are available for public access, such as PGSB and RepetDB, or restricted access such as Repbase. Although these databases are useful to identify LTR-RTs in new genomes by similarity, the elements of these databases are not fully classified to the lineage (also called family) level. Here, we present InpactorDB, a semi-curated dataset composed of 130,439 elements from 195 plant genomes (belonging to 108 plant species) classified to the lineage level. This dataset has been used to train two deep neural networks (i.e., one fully connected and one convolutional) for the rapid classification of these elements. In lineage-level classification approaches, we obtain up to 98% performance, indicated by the F1-score, precision and recall scores.

Highlights

  • Transposable elements (TEs) have key roles in plant genomes

  • We used the F1-score as the performance metric, which is the harmonic mean of precision and sensitivity [39] and we used it as the accuracy indicator; we used k-mer frequencies with 1 ≤ k ≤ 6 as features, and we used scaling and dimensional reduction using principal component analysis (PCA) as pre-processing steps, according to [39]

  • Using more than 4000 long terminal repeat (LTR)-reverse transcriptase (RT) from four plant species that were not included in InpactorDB, we achieved up to a 99% F1-Score using the FNN model, demonstrating good generalization performance

Read more

Summary

Introduction

Transposable elements (TEs) have key roles in plant genomes They are major contributors to genomic size [1,2], rearrangement events (such as fissions, fusions, and translocations) [3], chromosome organization and structure (e.g., centromeres) [4], and evolution and adaptation to the environment [5]. LTR-RTs are characterized by the presence of one or several open reading frames involved in the mobility of the element, flanked by a direct tandem repeat of 100 pb to more than 5000 bp, called LTR. These LTRs are directly involved in the transcription regulation of the element by the host’s machinery [22,23]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call