reference human genome
Recently Published Documents


TOTAL DOCUMENTS

18
(FIVE YEARS 8)

H-INDEX

4
(FIVE YEARS 1)

Author(s):  
Karen H. Miga ◽  
Ting Wang

The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome. Expected final online publication date for the Annual Review of Genomics and Human Genetics, Volume 22 is August 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


Author(s):  
James M. Holt ◽  
Melissa Kelly ◽  
Brett Sundlof ◽  
Ghunwa Nakouzi ◽  
David Bick ◽  
...  

Abstract Purpose Clinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity, it also results in increased turnaround time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing. Methods We sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results with an established set of variants for each genome referred to as a truth set. We then trained machine learning models to identify variants that were labeled as false positives. Results After training, the models identified 99.5% of the false positive heterozygous single-nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of nonactionable, nonprimary SNVs by 85% and indels by 75%. Employing the algorithm in clinical practice reduced overall orthogonal testing using dideoxynucleotide (Sanger) sequencing by 71%. Conclusion Our results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.


2020 ◽  
Vol 21 (1) ◽  
pp. 55-79 ◽  
Author(s):  
Daniel R. Zerbino ◽  
Adam Frankish ◽  
Paul Flicek

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.


2020 ◽  
Author(s):  
James M. Holt ◽  
Melissa Wilk ◽  
Brett Sundlof ◽  
Ghunwa Nakouzi ◽  
David Bick ◽  
...  

AbstractPurposeClinical genome sequencing (cGS) followed by orthogonal confirmatory testing is standard practice. While orthogonal testing significantly improves specificity it also results in increased turn-around-time and cost of testing. The purpose of this study is to evaluate machine learning models trained to identify false positive variants in cGS data to reduce the need for orthogonal testing.MethodsWe sequenced five reference human genome samples characterized by the Genome in a Bottle Consortium (GIAB) and compared the results to an established set of variants for each genome referred to as a ‘truth-set’. We then trained machine learning models to identify variants that were labeled as false positives.ResultsAfter training, the models identified 99.5% of the false positive heterozygous single nucleotide variants (SNVs) and heterozygous insertions/deletions variants (indels) while reducing confirmatory testing of true positive SNVs to 1.67% and indels to 20.29%. Employing the algorithm in clinical practice reduced orthogonal testing using dideoxynucleotide (Sanger) sequencing by 78.22%.ConclusionOur results indicate that a low false positive call rate can be maintained while significantly reducing the need for confirmatory testing. The framework that generated our models and results is publicly available at https://github.com/HudsonAlpha/STEVE.


Author(s):  
Mosè Manni ◽  
Evgeny Zdobnov

AbstractHuman pan-genome studies offer the opportunity to identify human non-reference sequences (NRSs) which are, by definition, not represented in the reference human genome (GRCh38). NRSs serve as useful catalogues of genetic variation for population and disease studies and while the majority consists of repetitive elements, a substantial fraction is made of non-repetitive, non-reference (NRNR) sequences. The presence of non-human sequences in these catalogues can inflate the number of “novel” human sequences, overestimate the genetic differentiation among populations, and jeopardize subsequent analyses that rely on these resources. We uncovered almost 2,000 contaminant sequences of microbial origin in NRNR sequences from recent human pan-genome studies. The contaminant contigs (3,501,302 bp) harbour genes totalling 4,720 predicted proteins (>40 aa). The major sources of contamination are related to Rhyzobiales, Burkholderiales, Pseudomonadales and Lactobacillales, which may have been associated with the original samples or introduced later during sequencing experiments. We additionally observed that the majority of human novel protein-coding genes described in one of the studies entirely overlap repetitive regions and are likely to be false positive predictions. We report here the list of contaminant sequences in three recent human pan-genome catalogues and discuss strategies to increase decontamination efficacy for current and future pan-genome studies.


2020 ◽  
Author(s):  
Lauren Coombe ◽  
Vladimir Nikolić ◽  
Justin Chu ◽  
Inanc Birol ◽  
René L. Warren

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]


2019 ◽  
Author(s):  
Andrey V. Bzikadze ◽  
Pavel A. Pevzner

AbstractAlthough variations in centromeres have been linked to cancer and infertility, centromeres still represent the “dark matter of the human genome” and remain an enigma for both biomedical and evolutionary studies. Since centromeres have withstood all previous attempts to develop an automated tool for their assembly and since their assembly using short reads is viewed as intractable, recent efforts attempted to manually assemble centromeres using long error-prone reads. We describe the centroFlye algorithm for centromere assembly using long error-prone reads, apply it for assembling the human X centromere, and use the constructed assembly to gain insights into centromere evolution. Our analysis reveals putative breakpoints in the previous manual reconstruction of the human X centromere and opens a possibility to automatically close the remaining multi-megabase gaps in the reference human genome.


2019 ◽  
Author(s):  
Ilias Georgakopoulos-Soares ◽  
Gene Koh ◽  
Josef Jiricny ◽  
Martin Hemberg ◽  
Serena Nik-Zainal

Introductory paragraphThe mechanisms that underpin how insertions or deletions (indels) become fixed in DNA have primarily been ascribed to replication-related and/or double-strand break (DSB)-related processes. We introduce a novel way to evaluate indels, orientating them relative to gene transcription. In so doing, we reveal a number of surprising findings: First, there is a transcriptional strand asymmetry in the distribution of mononucleotide repeat tracts in the reference human genome. Second, there is a strong transcriptional strand asymmetry of indels across 2,575 whole genome sequenced human cancers. We suggest that this is due to the activity of transcription-coupled nucleotide excision repair (TC-NER). Furthermore, TC-NER interacts with mismatch repair (MMR) under physiological conditions to produce strand bias. Finally, we show how insertions and deletions differ in their dependencies on these repair pathways. Our novel analytical approach reveals new insights into the contribution of DNA repair towards indel mutagenesis in human cells.


GigaScience ◽  
2018 ◽  
Vol 7 (12) ◽  
Author(s):  
Jie Huang ◽  
Xinming Liang ◽  
Yuankai Xuan ◽  
Chunyu Geng ◽  
Yuxiang Li ◽  
...  

Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 2332-2332
Author(s):  
Yan Zheng ◽  
Ti-Cheng Chang ◽  
Gang Wu ◽  
Jane S. Hankins ◽  
Mitchell J. Weiss ◽  
...  

Abstract Introduction RBC alloimmunization is common in patients with sickle cell disease (SCD). Despite serological matching RBCs for major Rh antigens, Rh alloimmunization remains problematic. The Rh blood group is encoded by two genes RHD and RHCE, which exhibit extensive nucleotide polymorphism and chromosome structural changes, resulting in the formation of Rh variant antigens. Rh variants can result in loss of protein epitopes or expression of neo-epitopes, and are common in SCD patients. Hence SCD patients harboring Rh variants can be predisposed to Rh alloimmunization. Given the limitation of traditional serologic antigen typing for detection of Rh variants, molecular genotyping has become required. A DNA microarray-based platform, BioArray RHCE and RHD BeadChip (Immuncor) is available for RH genotyping. However, it detects the most common, but not all, variants. Whole exome sequence data have been used for prediction of Rh variants (Chou, et. al, Blood Adv., 2017), offer some advantages, including detection of rare variants, structural rearrangements and copy number variation. However, whole genome sequence (WGS) analysis of RHD/RHCE is challenging due to difficulties in mapping next generation sequencing (NGS) reads to this duplicated gene family. We developed a computational algorithm to identify RH variants using WGS data. Methods The pipeline included three major components, RH allele database construction, RH variant calling, and classification of Rh blood group according the identified variants. The RH allele database was built based on NCBI Blood Group Antigen Gene Mutation (BGMUT) and International Society of Blood Transfusion (ISBT) database. Since the alleles in the BGMUT and ISBT databases were specified according to conventional RH genes (RHD, L08429; RHCE, DQ322275) that are different from those on reference human genome, we first called the variations based on the reference human genome. The positions of the identified variations were subsequently corrected to match with the BGMUT and ISBT annotation system. Next, the NGS reads with low base quality and/or mapping quality were discarded during the variation calling step. Synonymous and non-synonymous amino acid changes were characterized for each polymorphism. Haplotypes were constructed for the segments with NGS read support. Gene sequencing coverage was calculated to determine gene deletions or amplifications. Lastly, we implemented an algorithm to predict RH genotypes based on a selection of candidate alleles by read-mapping profile which considers both sequence variations and sequence consistency followed by a likelihood-based ranking of all pairwise combinations of the selected alleles. The allele combination with the highest likelihood is considered the most likely pair of alleles at a given locus. Patient specimens used in this study were from participants of the Sickle Cell Clinical Research and Intervention Program (SCCRIP, Hankins et al. Pediatr Blood Cancer. 2018). Results We validated our method in a cohort of 58 SCD patients whose RH genotypes had been determined by BioArray RhCE and RhD BeadChip and supplementary molecular tests that identify the most common variants among individuals of African descent. In this validation cohort including a total of 11 RHD and 13 RHCE alleles, our approach achieved a concordance rate of 85.85% (91 of 106 alleles) for RHD and 83.02% (88 of 106 alleles) for RHCE genotyping. WGS was highly sensitive in distinguishing homozygosity from heterozygosity of genes. By comparing the numbers of NGS reads on RH regions and whole genome average coverage, heterozygous deletion can be determined. Since WGS provides comprehensive genotyping, our analysis identified single nucleotide polymorphisms that were not identified by the BeadChip and supplemental molecular testing. The final source of discordance was likely due to the short read length of NGS such that haplotype phases cannot be correctly predicted if the variations are separated by thousands of base pairs, for which long read DNA sequencing or RNA/cDNA sequencing are required. Evaluation of the identified discrepancies is ongoing. Conclusions We developed and validated a diagnostic method for RH genotyping that leveraged the accuracy and flexibility of RH genotyping based on WGS data. With further optimization of our method, this may be useful for RBC genotype matching sickle cell patients to blood donors in the future. Disclosures Hankins: Novartis: Research Funding; Global Blood Therapeutics: Research Funding; NCQA: Consultancy; bluebird bio: Consultancy.


Sign in / Sign up

Export Citation Format

Share Document