scholarly journals Towards a reference genome that captures global genetic diversity

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


BMC Genomics ◽  
2011 ◽  
Vol 12 (1) ◽  
Author(s):  
Geng Chen ◽  
Ruiyuan Li ◽  
Leming Shi ◽  
Junyi Qi ◽  
Pengzhan Hu ◽  
...  

Genes ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 1350
Author(s):  
Jina Kim ◽  
Joohon Sung ◽  
Kyudong Han ◽  
Wooseok Lee ◽  
Seyoung Mun ◽  
...  

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.


2019 ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Our study enriched the spectrum of human genetic variations.


2017 ◽  
Author(s):  
Nadia M Davidson ◽  
Alicia Oshlack

AbstractBackgroundRNA-Seq analyses can benefit from performing a genome-guided and de novo assembly, in particular for species where the reference genome or the annotation is incomplete. However, tools for integrating assembled transcriptome with reference annotation are lacking.FindingsNecklace is a software pipeline that runs genome-guided and de novo assembly and combines the resulting transcriptomes with reference genome annotations. Necklace constructs a compact but comprehensive superTranscriptome out of the assembled and reference data. Reads are subsequently aligned and counted in preparation for differential expression testing.ConclusionsNecklace allows a comprehensive transcriptome to be built from a combination of assembled and annotated transcripts which results in a more comprehensive transcriptome for the majority of organisms. In addition RNA-seq data is mapped back to this newly created superTranscript reference to enable differential expression testing with standard methods. Necklace is available from https://github.com/Oshlack/necklace/wiki under GPL 3.0.


2019 ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2021 ◽  
Author(s):  
Danang Crysnanto ◽  
Alexander S. Leonard ◽  
Zih-Hua Fang ◽  
Hubert Pausch

Linear reference genomes are typically assembled from single individuals. They are unable to reflect the genetic diversity of populations and lack millions of bases. To overcome such limitations and make non-reference sequences amenable to genetic investigations, we build a multi-assembly graph from six reference-quality assemblies from taurine cattle and their close relatives. We uncover 70,329,827 bases that are missing in the bovine linear reference genome. The missing sequences encode novel transcripts that are differentially expressed between individual animals. Reads which were previously poorly or unmapped against the bovine reference genome now align accurately to the non-reference sequences. We show that the non-reference sequences contain polymorphic sites that segregate within and between breeds of cattle. Our efforts to uncover novel functional sequences from a multi-assembly graph pave the way towards the transition to a more representative bovine reference genome.


2020 ◽  
Author(s):  
Benjamin Kaminow ◽  
Sara Ballouz ◽  
Jesse Gillis ◽  
Alexander Dobin

The Human Reference Genome serves as the foundation for modern genomic analyses. However, in its present form, it does not adequately represent the vast genetic diversity of the human population. In this study, we explored the consensus genome as a potential successor of the current Reference genome, and assessed its effect on the accuracy of RNA-seq read alignment. In order to find the best haploid genome representation, we constructed consensus genomes at the Pan-human, Super-population and Population levels, utilizing variant information from the 1000 Genomes project. Using personal haploid genomes as the ground truth, we compared mapping errors for real RNA-seq reads aligned to the consensus genomes versus the Reference genome. For reads overlapping homozygous variants, we found that the mapping error decreased by a factor of ∼2-3 when the Reference was replaced with the Pan-human consensus genome. Interestingly, we also found that using more population-specific consensuses resulted in little to no increase over using the Pan-human consensus, suggesting a limit in the utility of incorporating more specific genomic variation. To assess the functional impact, we performed transcript expression quantification and found that the Pan-human consensus increases accuracy of transcript quantification for hundreds of transcripts.


2019 ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


BMC Biology ◽  
2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Young-gun Lee ◽  
Jin-young Lee ◽  
Junhyong Kim ◽  
Young-Joon Kim

Abstract Background Structural variants comprise diverse genomic arrangements including deletions, insertions, inversions, and translocations, which can generally be detected in humans through sequence comparison to the reference genome. Among structural variants, insertions are the least frequently identified variants, mainly due to ascertainment bias in the reference genome, lack of previous sequence knowledge, and low complexity of typical insertion sequences. Though recent developments in long-read sequencing deliver promise in annotating individual non-reference insertions, population-level catalogues on non-reference insertion variants have not been identified and the possible functional roles of these hidden variants remain elusive. Results To detect non-reference insertion variants, we developed a pipeline, InserTag, which generates non-reference contigs by local de novo assembly and then infers the full-sequence of insertion variants by tracing contigs from non-human primates and other human genome assemblies. Application of the pipeline to data from 2535 individuals of the 1000 Genomes Project helped identify 1696 non-reference insertion variants and re-classify the variants as retention of ancestral sequences or novel sequence insertions based on the ancestral state. Genotyping of the variants showed that individuals had, on average, 0.92-Mbp sequences missing from the reference genome, 92% of the variants were common (allele frequency > 5%) among human populations, and more than half of the variants were major alleles. Among human populations, African populations were the most divergent and had the most non-reference sequences, which was attributed to the greater prevalence of high-frequency insertion variants. The subsets of insertion variants were in high linkage disequilibrium with phenotype-associated SNPs and showed signals of recent continent-specific selection. Conclusions Non-reference insertion variants represent an important type of genetic variation in the human population, and our developed pipeline, InserTag, provides the frameworks for the detection and genotyping of non-reference sequences missing from human populations.


Sign in / Sign up

Export Citation Format

Share Document