Abstract

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef) than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH). Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species.

Highlights

  • A minimal absent word of a sequence is a word not found in the sequence; but the removal of its left- or rightmost character uncovers a word that is present in the sequence [1]

  • We compare two human genome assemblies sequenced with capillary-based technologies, namely, the reference human genome GRCh37 assembly and the HuRef assembly of the genome of Craig Venter, and two human genome assemblies sequenced with massively parallel technologies, namely, the NA12878 assembly from cell line GM12878 and the YH assembly of the genome of a Han Chinese individual

  • In order to enhance the differences between these non-stationary distributions, we will consider the distributions divided into four ranges of minimal absent word lengths, namely, 10 bp ƒjcjv 100 bp, 100 bp ƒjcjv 1 kb, 1 kb ƒjcjv 10 kb and 10 kb ƒjcjv 100 kb, where unit bp stands for base pairs and unit kb stands for kilobase pairs

Read more

Summary

Introduction

A minimal absent word of a sequence is a word not found in the sequence; but the removal of its left- or rightmost character uncovers a word that is present in the sequence [1]. The set of minimal absent words of these two sequences, concatenated such that artificial words across the boundary between both words are ignored, is fAAA, AAG, AAT, ACA, ACG, ACT, AGA, AGG, AGT, ATA, ATT, CAA, CAC, CAG, CCA, CCC, CCT, CGC, CGT, CTC, CTG, CTT, GAA, GAC, GAG, GCA, GCC, GCG, GGA, GGC, GGG, GTA, GTC, GTG, TAC, TAT, TCA, TCC, TCT, TGA, TGC, TGG, TGT, TTC, TTG, TTT, AGCT, CATG, CCGG, CTAG, GATC, TCGA, TTAAg, and the set of maximal exact repeats is fA, C, G, T, AT, CG, GC, TAg

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.