Abstract

The clustered regularly interspaced short palindromic repeats (CRISPR)/Cas-mediated genome editing system has recently been used for haploid production in plants. Haploid induction using the CRISPR/Cas system represents an attractive approach in cannabis, an economically important industrial, recreational, and medicinal plant. However, the CRISPR system requires the design of precise (on-target) single-guide RNA (sgRNA). Therefore, it is essential to predict off-target activity of the designed sgRNAs to avoid unexpected outcomes. The current study is aimed to assess the predictive ability of three machine learning (ML) algorithms (radial basis function (RBF), support vector machine (SVM), and random forest (RF)) alongside the ensemble-bagging (E-B) strategy by synergizing MIT and cutting frequency determination (CFD) scores to predict sgRNA off-target activity through in silico targeting a histone H3-like centromeric protein, HTR12, in cannabis. The RF algorithm exhibited the highest precision, recall, and F-measure compared to all the tested individual algorithms with values of 0.61, 0.64, and 0.62, respectively. We then used the RF algorithm as a meta-classifier for the E-B method, which led to an increased precision with an F-measure of 0.62 and 0.66, respectively. The E-B algorithm had the highest area under the precision recall curves (AUC-PRC; 0.74) and area under the receiver operating characteristic (ROC) curves (AUC-ROC; 0.71), displaying the success of using E-B as one of the common ensemble strategies. This study constitutes a foundational resource of utilizing ML models to predict gRNA off-target activities in cannabis.

Highlights

  • The crop is generally divided and regulated as two main groups based on the level of produced tetrahydrocannabinol (THC), with anything below 0.3% THC considered hemp and plants that produce 0.3%

  • To predict the single-guide RNA (sgRNA) cleavage efficiency, an initial dataset of 1900 putative off-target sequences including 950 true positive off-targets identified with a mismatch count of up to four recognized by clustered regularly interspaced short palindromic repeats (CRISPR) [33] was used

  • MIT and cutting frequency determination (CFD) scores were used as input variables for a vector machine (SVM)

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Cannabis sativa L. has a long history of human use for various applications including fibers, food, medicine, and for its psychoactive properties [1]. The crop is generally divided and regulated as two main groups based on the level of produced tetrahydrocannabinol (THC), with anything below 0.3% THC considered hemp and plants that produce 0.3%. THC or more classified as marijuana [2]. Marijuana and some hemp genotypes are dioecious crops meaning the male and female reproductive systems occur on separate plants [3]. Seedless and unfertilized female cannabis flowers are the most economical product [4]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call