A mismatch-free hybridization of oligonucleotides containing from 11 to 20 monomers to unknown DNA represents, in essence, a sequencing of a complementary target. Realizing this, we have used probability calculations and, in part, computer simulations to estimate the types and numbers of oligonucleotides that would have to be synthesized in order to sequence a megabase plus segment of DNA. We estimate that 95,000 specific mixes of 11-mers, mainly of the 5′ (A,T,C,G)(A,T,C,G)N8(A,T,C,G)3′ type, hybridized consecutively to dot blots of cloned genomic DNA fragments would provide primary data for the sequence assembly. An optimal mixture of representative libraries in M13 vector, having inserts of (i) 7kb, (ii) 0.5 kb genomic fragments randomly ligated in up to 10-kb inserts, and (iii) tandem “jumping” fragments 100 kb apart in the genome, will be needed. To sequence each million base pairs of DNA, one would need hybridization data from about 2100 separate hybridization sample dots. Inevitable gaps and uncertainties in alignment of sequenced fragments arising from nonrandom or repetitive sequence organization of complex genomes and difficulties in cloning “poisonous” sequences in Escherichia coli, inherent to large sequencing by any method, have been considered and minimized by choice of libraries and number of subclones used for hybridization. Because it is based on simpler biochemical procedures, our method is inherently easier to automate than existing sequencing methods. The sequence can be derived from simple primary data only by extensive computing. Phased experimental tests and computer simulations increasing in complexity are needed before accurate estimates can be made in terms of cost and speed of sequencing by the new approach. Nevertheless, sequencing by hybridization should show advantages over existing methods because of the inherent redundancy and parallelism in its data gathering.
Read full abstract