Sequence repetitiveness quantification and de novo repeat detection by weighted k-mer coverage

Abstract DNA repeats are abundant in eukaryotic genomes and have been proved to play a vital role in genome evolution and regulation. A large number of approaches have been proposed to identify various repeats in the genome. Some de novo repeat identification tools can efficiently generate sequence repetitive scores based on k-mer counting for repeat detection. However, we noticed that these tools can still be improved in terms of repetitive score calculation, sensitivity to segmental duplications and detection specificity. Therefore, here, we present a new computational approach named Repeat Locator (RepLoc), which is based on weighted k-mer coverage to quantify the genome sequence repetitiveness and locate the repetitive sequences. According to the repetitiveness map of the human genome generated by RepLoc, we found that there may be relationships between sequence repetitiveness and genome structures. A comprehensive benchmark shows that RepLoc is a more efficient k-mer counting based tool for de novo repeat detection. The RepLoc software is freely available at http://bis.zju.edu.cn/reploc.

Download Full-text

Telomere-to-telomere assembly of a complete human X chromosome

10.1101/735928 ◽

2019 ◽

Cited By ~ 43

Author(s):

Karen H. Miga ◽

Sergey Koren ◽

Arang Rhie ◽

Mitchell R. Vollger ◽

Ariel Gershman ◽

...

Keyword(s):

Human Genome ◽

X Chromosome ◽

Satellite Dna ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

High Coverage ◽

Current Reference

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.

Download Full-text

Genome, Transcriptome, and Germplasm Sequencing Uncovers Functional Variation in the Warm-Season Grain Legume Horsegram Macrotyloma uniflorum (Lam.) Verdc.

Frontiers in Plant Science ◽

10.3389/fpls.2021.758119 ◽

2021 ◽

Vol 12 ◽

Author(s):

H. B. Mahesh ◽

M. K. Prasannakumar ◽

K. G. Manasa ◽

Sampath Perumal ◽

Yogendra Khedikar ◽

...

Keyword(s):

Genome Wide Association Study ◽

De Novo ◽

Repetitive Sequences ◽

Warm Season ◽

Grain Legume ◽

Dna Repeats ◽

Functional Variation ◽

Food Ingredient ◽

Total Size ◽

A Genome

Horsegram is a grain legume with excellent nutritional and remedial properties and good climate resilience, able to adapt to harsh environmental conditions. Here, we used a combination of short- and long-read sequencing technologies to generate a genome sequence of 279.12Mb, covering 83.53% of the estimated total size of the horsegram genome, and we annotated 24,521 genes. De novo prediction of DNA repeats showed that approximately 25.04% of the horsegram genome was made up of repetitive sequences, the lowest among the legume genomes sequenced so far. The major transcription factors identified in the horsegram genome were bHLH, ERF, C2H2, WRKY, NAC, MYB, and bZIP, suggesting that horsegram is resistant to drought. Interestingly, the genome is abundant in Bowman–Birk protease inhibitors (BBIs), which can be used as a functional food ingredient. The results of maximum likelihood phylogenetic and estimated synonymous substitution analyses suggested that horsegram is closely related to the common bean and diverged approximately 10.17 million years ago. The double-digested restriction associated DNA (ddRAD) sequencing of 40 germplasms allowed us to identify 3,942 high-quality SNPs in the horsegram genome. A genome-wide association study with powdery mildew identified 10 significant associations similar to the MLO and RPW8.2 genes. The reference genome and other genomic information presented in this study will be of great value to horsegram breeding programs. In addition, keeping the increasing demand for food with nutraceutical values in view, these genomic data provide opportunities to explore the possibility of horsegram for use as a source of food and nutraceuticals.

Download Full-text

mosaicFlye: Resolving long mosaic repeats using long error-prone reads

10.1101/2020.01.15.908285 ◽

2020 ◽

Cited By ~ 3

Author(s):

Anton Bankevich ◽

Pavel Pevzner

Keyword(s):

Human Chromosome ◽

Human Genome ◽

Genome Assembly ◽

Chromosome 6 ◽

Segmental Duplications ◽

Bacterial Genomes ◽

Long Read ◽

Human Chromosome 6 ◽

Genome Assemblies ◽

Eukaryotic Genomes

AbstractLong-read technologies revolutionized genome assembly and enabled resolution of bridged repeats (i.e., repeats that are spanned by some reads) in various genomes. However, the problem of resolving unbridged repeats (such as long segmental duplications in the human genome) remains largely unsolved, making it a major obstacle towards achieving the goal of complete genome assemblies. Moreover, the challenge of resolving unbridged repeats is not limited to eukaryotic genomes but also impairs assemblies of bacterial genomes and metagenomes. We describe the mosaicFlye algorithm for resolving complex unbridged repeats based on differences between various repeat copies and show how it improves assemblies of the human genome as well as bacterial genomes and metagenomes. In particular, we show that mosaicFlye results in a complete assembly of both arms of the human chromosome 6.

Download Full-text

Faculty Opinions recommendation of Recent segmental duplications in the human genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1008862.140157 ◽

2002 ◽

Author(s):

Stephan Beck

Keyword(s):

Human Genome ◽

Segmental Duplications

Download Full-text

The L1-dependant and Pol III transcribed Alu retrotransposon, from its discovery to innate immunity

Molecular Biology Reports ◽

10.1007/s11033-021-06258-4 ◽

2021 ◽

Vol 48 (3) ◽

pp. 2775-2789

Author(s):

Ludwig Stenz

Keyword(s):

Human Genome ◽

Viral Infections ◽

De Novo ◽

Current Knowledge ◽

Innate Immune ◽

De Novo Mutation ◽

Neuronal Diversity ◽

Alu Sequences ◽

Pol Iii ◽

The Brain

AbstractThe 300 bp dimeric repeats digestible by AluI were discovered in 1979. Since then, Alu were involved in the most fundamental epigenetic mechanisms, namely reprogramming, pluripotency, imprinting and mosaicism. These Alu encode a family of retrotransposons transcribed by the RNA Pol III machinery, notably when the cytosines that constitute their sequences are de-methylated. Then, Alu hijack the functions of ORF2 encoded by another transposons named L1 during reverse transcription and integration into new sites. That mechanism functions as a complex genetic parasite able to copy-paste Alu sequences. Doing that, Alu have modified even the size of the human genome, as well as of other primate genomes, during 65 million years of co-evolution. Actually, one germline retro-transposition still occurs each 20 births. Thus, Alu continue to modify our human genome nowadays and were implicated in de novo mutation causing diseases including deletions, duplications and rearrangements. Most recently, retrotransposons were found to trigger neuronal diversity by inducing mosaicism in the brain. Finally, boosted during viral infections, Alu clearly interact with the innate immune system. The purpose of that review is to give a condensed overview of all these major findings that concern the fascinating physiology of Alu from their discovery up to the current knowledge.

Download Full-text

Genomic Tackling of Human Satellite DNA: Breaking Barriers through Time

International Journal of Molecular Sciences ◽

10.3390/ijms22094707 ◽

2021 ◽

Vol 22 (9) ◽

pp. 4707

Author(s):

Mariana Lopes ◽

Sandra Louzada ◽

Margarida Gama-Carvalho ◽

Raquel Chaves

Keyword(s):

Human Genome ◽

Satellite Dna ◽

Repetitive Sequences ◽

Nucleotide Composition ◽

Genomic Component ◽

Genomic Studies ◽

Human Genomic ◽

Definition Of ◽

High Degree

(Peri)centromeric repetitive sequences and, more specifically, satellite DNA (satDNA) sequences, constitute a major human genomic component. SatDNA sequences can vary on a large number of features, including nucleotide composition, complexity, and abundance. Several satDNA families have been identified and characterized in the human genome through time, albeit at different speeds. Human satDNA families present a high degree of sub-variability, leading to the definition of various subfamilies with different organization and clustered localization. Evolution of satDNA analysis has enabled the progressive characterization of satDNA features. Despite recent advances in the sequencing of centromeric arrays, comprehensive genomic studies to assess their variability are still required to provide accurate and proportional representation of satDNA (peri)centromeric/acrocentric short arm sequences. Approaches combining multiple techniques have been successfully applied and seem to be the path to follow for generating integrated knowledge in the promising field of human satDNA biology.

Download Full-text

Choice of assembly software has a critical impact on virome characterisation

10.1101/479105 ◽

2018 ◽

Author(s):

Thomas D.S. Sutton ◽

Adam G. Clooney ◽

Feargal J. Ryan ◽

R. Paul Ross ◽

Colin Hill

Keyword(s):

De Novo ◽

Vital Role ◽

Well Performance ◽

Community Members ◽

Reference Databases ◽

Downstream Analysis ◽

Metagenomic Assembly ◽

Assembly Performance ◽

Genomic Repeats

AbstractBackgroundThe viral component of microbial communities play a vital role in driving bacterial diversity, facilitating nutrient turnover and shaping community composition. Despite their importance, the vast majority of viral sequences are poorly annotated and share little or no homology to reference databases. As a result, investigation of the viral metagenome (virome) relies heavily on de novo assembly of short sequencing reads to recover compositional and functional information. Metagenomic assembly is particularly challenging for virome data, often resulting in fragmented assemblies and poor recovery of viral community members. Despite the essential role of assembly in virome analysis and difficulties posed by these data, current assembly comparisons have been limited to subsections of virome studies or bacterial datasets.DesignThis study presents the most comprehensive virome assembly comparison to date, featuring 16 metagenomic assembly approaches which have featured in human virome studies. Assemblers were assessed using four independent virome datasets, namely; simulated reads, two mock communities, viromes spiked with a known phage and human gut viromes.ResultsAssembly performance varied significantly across all test datasets, with SPAdes (meta) performing consistently well. Performance of MIRA and VICUNA varied, highlighting the importance of using a range of datasets when comparing assembly programs. It was also found that while some assemblers addressed the challenges of virome data better than others, all assemblers had limitations. Low read coverage and genomic repeats resulted in assemblies with poor genome recovery, high degrees of fragmentation and low accuracy contigs across all assemblers. These limitations must be considered when setting thresholds for downstream analysis and when drawing conclusions from virome data.

Download Full-text

Genus-wide characterization of bumblebee genomes reveals variation associated with key ecological and behavioral traits of pollinators

10.1101/2020.05.29.122879 ◽

2020 ◽

Author(s):

Cheng Sun ◽

Jiaxing Huang ◽

Yun Wang ◽

Xiaomeng Zhao ◽

Long Su ◽

...

Keyword(s):

Social Evolution ◽

De Novo ◽

Phenotypic Diversity ◽

Incomplete Lineage Sorting ◽

Repetitive Sequences ◽

Gene Tree ◽

Pathogen Transmission ◽

Genomic Variation ◽

Social Parasite ◽

Behavioral Traits

AbstractBumblebees are a diverse group of globally important pollinators in natural ecosystems and for agricultural food production. With both eusocial and solitary lifecycle phases, and some social parasite species, they are especially interesting models to understand social evolution, behavior, and ecology. Reports of many species in decline point to pathogen transmission, habitat loss, pesticide usage, and global climate change, as interconnected causes. These threats to bumblebee diversity make our reliance on a handful of well-studied species for agricultural pollination particularly precarious. To broadly sample bumblebee genomic and phenotypic diversity, we de novo sequenced and assembled the genomes of 17 species, representing all 15 subgenera, producing the first genus-wide quantification of genetic and genomic variation potentially underlying key ecological and behavioral traits. The species phylogeny resolves subgenera relationships while incomplete lineage sorting likely drives high levels of gene tree discordance. Five chromosome-level assemblies show a stable 18-chromosome karyotype, with major rearrangements creating 25 chromosomes in social parasites. Differential transposable element activity drives changes in genome sizes, with putative domestications of repetitive sequences influencing gene coding and regulatory potential. Dynamically evolving gene families and signatures of positive selection point to genus-wide variation in processes linked to foraging, diet and metabolism, immunity and detoxification, as well as adaptations for life at high altitudes. These high-quality genomic resources capture natural genetic and phenotypic variation across bumblebees, offering new opportunities to advance our understanding of their remarkable ecological success and to identify and manage current and future threats.

Download Full-text

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

10.1101/068338 ◽

2016 ◽

Cited By ~ 4

Author(s):

Shaun D Jackman ◽

Benjamin P Vandervalk ◽

Hamid Mohamadi ◽

Justin Chu ◽

Sarah Yeo ◽

...

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Message Passing ◽

Large Scale ◽

De Novo ◽

Bloom Filter ◽

Genomic Variation ◽

De Bruijn Graph ◽

Single Individual ◽

Probabilistic Data Structure

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Download Full-text

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

10.1101/2020.12.06.413880 ◽

2020 ◽

Author(s):

Anyou Wang ◽

Rong Hai

Keyword(s):

Human Genome ◽

Rna Processing ◽

Self Regulation ◽

Post Translational Modification ◽

Protein Coding ◽

Noncoding Regions ◽

Coding Regions ◽

Rnaseq Data ◽

Response To Stress ◽

Eukaryotic Genomes

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text