Abstract
BackgroundThe continuing increase in size and quality of the “short reads” raw data is a significant help for the quality of the assembly obtained through various bioinformatics tools. However, building a reference genome sequence for most plant species remains a significant challenge due to the large number of repeated sequences which are problematic for a whole-genome quality de novo assembly. Furthermore, for most SNP identification approaches in plant genetics and breeding, only the “Gene-space” regions including the promoter, exon and intron sequences are considered.ResultsWe developed the iPea protocol to produce a de novo Gene-space assembly by reconstructing, in an iterative way, the non-coding sequence flanking the Unigene cDNA sequence through addition of next-generation DNA-seq data. The approach was elaborated with the large diploid genome of pea (Pisumsativum L.), rich in repetitive sequences. The final Gene-space assembly included 35,400 contigs (97 Mb), covering 88 % of the 40,227 contigs (53.1 Mb) of the PsCam_low-copy Unigen set. Its accuracy was validated by the results of the built GenoPea 13.2 K SNP Array.ConclusionThe iPEA protocol allows the reconstruction of a Gene-space based from RNA-Seq and DNA-seq data with limited computing resources.Electronic supplementary materialThe online version of this article (doi:10.1186/s13104-016-1903-z) contains supplementary material, which is available to authorized users.
Highlights
The continuing increase in size and quality of the “short reads” raw data is a significant help for the quality of the assembly obtained through various bioinformatics tools
For single nucleotide polymorphism (SNP) identification, the Unigene set is still limited by the fact that much of the variability between genotypes of crop species are found in the noncoding portion of the gene, less subject to selection pressure
We present here a bioinformatics approach which allows, for a diploid species without a complete genome reference sequence, the de novo assembly of a Gene-space combining DNA-seq data from high throughput sequencing and a Unigene sequence set built from RNA-seq data
Summary
The continuing increase in size and quality of the “short reads” raw data is a significant help for the quality of the assembly obtained through various bioinformatics tools. Building a reference genome sequence for most plant species remains a significant challenge due to the large number of repeated sequences which are problematic for a whole-genome quality de novo assembly. Next-generation sequencing (NGS) technologies and their low cost provide an easy access to the sequences of many genotypes and to the single nucleotide polymorphisms (SNPs) This ability has changed many applications of plant and animal genetics: analysis of genetic. For most plant species, building a reference genome sequence remains a significant challenge due to the large number of repeated sequences that are problematic for a quality de novo assembly. For SNP identification, the Unigene set is still limited by the fact that much of the variability between genotypes of crop species are found in the noncoding portion of the gene (intron sequences, 3′ and 5′ UTR), less subject to selection pressure
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.