304. Deriving Useful Data from Next-Gen Sequencing of AAV Capsid Libraries

Damien Marsic,Sergei Zolotukhin

doi:10.1016/s1525-0016(16)33913-2

Abstract

The relatively short read lengths produced by the major Next-Gen sequencing platforms (up to 300 nt for Illumina, up to 200 nt for Ion Proton) are poorly suited for analyzing large combinatorial libraries of longer nucleotide sequences, such as those involving AAV capsid genes. Despite its much lower throughput and accuracy, the PacBio single-molecule real-time technology is currently the only option for long templates, with average read lengths of 10 to 15 kb. Using the Circular Consensus Sequencing (CCS) mode, in which template DNA fragments are circularized, allows a significant increase in accuracy due to the fact that each template is being sequenced multiple times. To interpret PacBio CCS data, we have previously reported a first version of the CapLib code, which was developed to identify variable regions in AAV combinatorial capsid libraries.DNA fragments, derived from purified DNA-containing AAV particles, 869 bp in length and including 27 variable nucleotide positions, were sequenced in CCS mode using the P6-C4 chemistry. A total of 26,897 reads were obtained, with a mean read length of 814 nt, a mean read quality of 0.9956 and a mean number of passes of 21.34. Only 5,456 reads had the correct size of 869 nt, and of these, only 1,638 had a sequence that matched the reference sequence, indicating that only 6% of reads were potentially error-free and that the vast majority had multiple insertions and deletions.In order to extract more useful information from the sequencing data, a new version of the CapLib software was developed. It is designed to correct sequencing reads in silico by assuming that constant nucleotide positions are wild-type and focusing on the detection of the variable positions. The premise was validated by Sanger sequencing of multiple clones, confirming that mutations were present only in the intended positions. Depending on the parameter values used, up to 14,000 reads could be recovered by CapLib 2.In addition to recovering PacBio CCS reads, CapLib 2 can also assemble Sanger sequencing data, translate recovered reads into protein sequences and perform detailed analyses of the dataset. It can also analyze clones resulting from directed evolution experiments and compare them with the original library.

Full Text