scholarly journals From telomere to telomere: the transcriptional and epigenetic state of human repeat elements

2021 ◽  
Author(s):  
Savannah J Hoyt ◽  
Jessica M Storer ◽  
Gabrielle A Hartley ◽  
Patrick G.S. Grady ◽  
Ariel Gershman ◽  
...  

Mobile elements and highly repetitive genomic regions are potent sources of lineage-specific genomic innovation and fingerprint individual genomes. Comprehensive analyses of large, composite or arrayed repeat elements and those found in more complex regions of the genome require a complete, linear genome assembly. Here we present the first de novo repeat discovery and annotation of a complete human reference genome, T2T-CHM13v1.0. We identified novel satellite arrays, expanded the catalog of variants and families for known repeats and mobile elements, characterized new classes of complex, composite repeats, and provided comprehensive annotations of retroelement transduction events. Utilizing PRO-seq to detect nascent transcription and nanopore sequencing to delineate CpG methylation profiles, we defined the structure of transcriptionally active retroelements in humans, including for the first time those found in centromeres. Together, these data provide expanded insight into the diversity, distribution and evolution of repetitive regions that have shaped the human genome.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2020 ◽  
Vol 10 (8) ◽  
pp. 2801-2809 ◽  
Author(s):  
Tingting Zhao ◽  
Zhongqu Duan ◽  
Georgi Z. Genchev ◽  
Hui Lu

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.


F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 1750 ◽  
Author(s):  
Tapan Kumar Mondal ◽  
Hukam Chand Rawal ◽  
Kishor Gaikwad ◽  
Tilak Raj Sharma ◽  
Nagendra Kumar Singh

Oryza coarctata plants, collected from Sundarban delta of West Bengal, India, have been used in the present study to generate draft genome sequences, employing the hybrid genome assembly with Illumina reads and third generation Oxford Nanopore sequencing technology. We report for the first time that more than 85.71 % of the genome coverage and the data have been deposited in NCBI SRA, with BioProject ID PRJNA396417.


Genes ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 1350
Author(s):  
Jina Kim ◽  
Joohon Sung ◽  
Kyudong Han ◽  
Wooseok Lee ◽  
Seyoung Mun ◽  
...  

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.


2020 ◽  
Vol 36 (9) ◽  
pp. 2899-2901
Author(s):  
Quanhu Sheng ◽  
Hui Yu ◽  
Olufunmilola Oyebamiji ◽  
Jiandong Wang ◽  
Danqian Chen ◽  
...  

Abstract Motivation Genome annotation is an important step for all in-depth bioinformatics analysis. It is imperative to augment quantity and diversity of genome-wide annotation data for the latest reference genome to promote its adoption by ongoing and future impactful studies. Results We developed a python toolkit AnnoGen, which at the first time, allows the annotation of three pragmatic genomic features for the GRCh38 genome in enormous base-wise quantities. The three features are chemical binding Energy, sequence information Entropy and Homology Score. The Homology Score is an exceptional feature that captures the genome-wide homology through single-base-offset tiling windows of 100 continual nucleotide bases. AnnoGen is capable of annotating the proprietary pragmatic features for variable user-interested genomic regions and optionally comparing two parallel sets of genomic regions. AnnoGen is characterized with simple utility modes and succinct HTML report of informative statistical tables and plots. Availability and implementation https://github.com/shengqh/annogen.


2019 ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Our study enriched the spectrum of human genetic variations.


2019 ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2021 ◽  
Author(s):  
Jaclyn M Noshay ◽  
Zhikai Liang ◽  
Peng Zhou ◽  
Peter A Crisp ◽  
Alexandre P Marand ◽  
...  

AbstractAccessible chromatin and unmethylated DNA are associated with many genes and cis-regulatory elements. Attempts to understand natural variation for accessible chromatin regions (ACRs) and unmethylated regions (UMRs) often rely upon alignments to a single reference genome. This limits the ability to assess regions that are absent in the reference genome assembly and monitor how nearby structural variants influence variation in chromatin state. In this study, de novo genome assemblies for four maize inbreds (B73, Mo17, Oh43 and W22) are utilized to assess chromatin accessibility and DNA methylation patterns in a pan-genome context. The number of UMRs and ACRs that can be identified is more accurate when chromatin data is aligned to the matched genome rather than a single reference genome. While there are UMRs and ACRs present within genomic regions that are not shared between genotypes, these features are substantially enriched within shared regions, as determined by chromosomal alignments. Characterization of UMRs present within shared genomic regions reveals that most UMRs maintain the unmethylated state in other genotypes with only a small number being polymorphic between genotypes. However, the majority of UMRs between genotypes only exhibit partial overlaps suggesting that the boundaries between methylated and unmethylated DNA are dynamic. This instability is not solely due to sequence variation as these partially overlapping UMRs are frequently found within genomic regions that lack sequence variation. The ability to compare chromatin properties among individuals with structural variation enables pan-epigenome analyses to study the sources of variation for accessible chromatin and unmethylated DNA.Article summaryRegions of the genome that have accessible chromatin or unmethylated DNA are often associated with cis-regulatory elements. We assessed chromatin accessibility and DNA methylation in four structurally diverse maize genomes. There are accessible or unmethylated regions within the non-shared portions of the genomes but these features are depleted within these regions. Evaluating the dynamics of methylation and accessibility between genotypes reveals conservation of features, albeit with variable boundaries suggesting some instability of the precise edges of unmethylated regions.


2019 ◽  
Author(s):  
Shuo Zhang ◽  
Erin S. Kelleher

ABSTRACTThe regulation of transposable element (TE) activity by small RNAs is a ubiquitous feature of germlines. However, despite the obvious benefits to the host in terms of ensuring the production of viable gametes and maintaining the integrity of the genomes they carry, it remains controversial whether TE regulation evolves adaptively. We examined the emergence and evolutionary dynamics of repressor alleles after P-elements invaded the Drosophila melanogaster genome in the mid 20th century. In many animals including Drosophila, repressor alleles are produced by transpositional insertions into piRNA clusters, genomic regions encoding the Piwi-interacting RNAs (piRNAs) that regulate TEs. We discovered that ∼94% of recently collected isofemale lines in the Drosophila Genetic Reference Panel (DGRP) contain at least one P-element insertion in a piRNA cluster, indicating that repressor alleles are produced by de novo insertion at an exceptional rate. Furthermore, in our sample of ∼200 genomes, we uncovered no fewer than 80 unique P-element insertion alleles in at least 15 different piRNA clusters. Finally, we observe no footprint of positive selection on P-element insertions in piRNA clusters, suggesting that the rapid evolution of piRNA-mediated repression in D. melanogaster was driven primarily by mutation. Our results reveal for the first time how the unique genetic architecture of piRNA production, in which numerous piRNA clusters can encode regulatory small RNAs upon transpositional insertion, facilitates the non-adaptive rapid evolution of repression.


2019 ◽  
Vol 48 (1) ◽  
pp. 290-303 ◽  
Author(s):  
Christopher E Ellison ◽  
Weihuan Cao

Abstract Illumina sequencing has allowed for population-level surveys of transposable element (TE) polymorphism via split alignment approaches, which has provided important insight into the population dynamics of TEs. However, such approaches are not able to identify insertions of uncharacterized TEs, nor can they assemble the full sequence of inserted elements. Here, we use nanopore sequencing and Hi-C scaffolding to produce de novo genome assemblies for two wild strains of Drosophila melanogaster from the Drosophila Genetic Reference Panel (DGRP). Ovarian piRNA populations and Illumina split-read TE insertion profiles have been previously produced for both strains. We find that nanopore sequencing with Hi-C scaffolding produces highly contiguous, chromosome-length scaffolds, and we identify hundreds of TE insertions that were missed by Illumina-based methods, including a novel micropia-like element that has recently invaded the DGRP population. We also find hundreds of piRNA-producing loci that are specific to each strain. Some of these loci are created by strain-specific TE insertions, while others appear to be epigenetically controlled. Our results suggest that Illumina approaches reveal only a portion of the repetitive sequence landscape of eukaryotic genomes and that population-level resequencing using long reads is likely to provide novel insight into the evolutionary dynamics of repetitive elements.


Sign in / Sign up

Export Citation Format

Share Document