scholarly journals centroFlye: Assembling Centromeres with Long Error-Prone Reads

2019 ◽  
Author(s):  
Andrey V. Bzikadze ◽  
Pavel A. Pevzner

AbstractAlthough variations in centromeres have been linked to cancer and infertility, centromeres still represent the “dark matter of the human genome” and remain an enigma for both biomedical and evolutionary studies. Since centromeres have withstood all previous attempts to develop an automated tool for their assembly and since their assembly using short reads is viewed as intractable, recent efforts attempted to manually assemble centromeres using long error-prone reads. We describe the centroFlye algorithm for centromere assembly using long error-prone reads, apply it for assembling the human X centromere, and use the constructed assembly to gain insights into centromere evolution. Our analysis reveals putative breakpoints in the previous manual reconstruction of the human X centromere and opens a possibility to automatically close the remaining multi-megabase gaps in the reference human genome.




2015 ◽  
Vol 7 (13) ◽  
pp. 1627-1630 ◽  
Author(s):  
Concetta Avitabile ◽  
Luca Domenico D'Andrea ◽  
Alessandra Romanelli
Keyword(s):  


Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 605-605
Author(s):  
Marco A. Marra ◽  
Martin Krzywinski ◽  
Readman Chiu ◽  
Matthew Field ◽  
Inanc Birol ◽  
...  

Abstract With the aim of identifying and sequencing mutations in follicular lymphoma genomes, we have begun a project to generate at least 24 deeply redundant sequence-ready Bacterial Artificial Clone (BAC) - based whole genome maps, each from a different individual’s lymphoma. BAC-array CGH and Affymetrix whole-genome sampling assays (WGSA) will be used along with the mapping data to identify genomic amplifications and losses in the lymphomas. Results from the mapping and array studies will be used to prioritize BAC clones for sequence analysis. Because each map will span essentially the entire genome of the corresponding lymphoma, we anticipate that essentially all regions of each tumor genome will be represented in easily sequenced BAC clones. This approach facilitates targeted sequencing of genomic regions of interest, including those containing genes relevant to cancer or harboring amplifications or deletions. Our mapping strategy hinges on the successful creation of deeply redundant high quality BAC libraries from primary lymphomas and large scale high throughput restriction enzyme fingerprinting of individual BACs with a version of the technology we used to map the human, mouse, rat and other genomes. The effort is large-scale, and will result in the generation of at least 2.5 million fingerprinted BAC clones over the next three years. Using the fingerprints, we will align the BACs to the reference human genome to assess genome coverage and to identify candidate genome rearrangements. In parallel, we will assemble the fingerprints into genome maps, looking for larger-scale genome variations between the lymphoma maps and the reference genome sequence. To test the feasibility of our approach, we obtained two restriction digest fingerprints from each of 140,000 individual BAC clones. BACs were sampled from a 7-fold redundant BAC library that had been created from genomic DNA purified from a primary follicular lymphoma sample. The fingerprints are being assembled into a clone map with the intent of reconstructing the entire tumor genome. 90,377 fingerprinted clones with unambiguous single alignments to the reference sequence were automatically assembled into 15,538 contigs. Subsequent rounds of semi-automatic contig merging further reduced the number of contigs to 5,433. Only 1,241 clones remained unassembled. We anchored the tumor genome map to the reference human genome sequence by aligning the clone fingerprints to the restriction map computed from the reference sequence assembly. As a result of this, we identified a BAC that captured the canonical t(14;18) translocation characteristic of follicular lymphomas. We sequenced this BAC and confirmed that it contains the expected translocation. Almost 2.6 gigabases (~91%) of the reference genome are represented in the evolving map, with an additional 50,000 clone fingerprints awaiting incorporation into the map assembly. Among these are repeat-rich and other clones that may well harbor genome rearrangements. Additional prioritization of sequencing targets will be undertaken when map construction and analysis of genome copy number alterations are complete.



2012 ◽  
Vol 22 (9) ◽  
pp. 1760-1774 ◽  
Author(s):  
J. Harrow ◽  
A. Frankish ◽  
J. M. Gonzalez ◽  
E. Tapanari ◽  
M. Diekhans ◽  
...  


Author(s):  
Karen H. Miga ◽  
Ting Wang

The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome. Expected final online publication date for the Annual Review of Genomics and Human Genetics, Volume 22 is August 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.



2017 ◽  
Author(s):  
Patrick Marks ◽  
Sarah Garcia ◽  
Alvaro Martinez Barrio ◽  
Kamila Belhocine ◽  
Jorge Bernate ◽  
...  

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.



2020 ◽  
Vol 21 (1) ◽  
pp. 55-79 ◽  
Author(s):  
Daniel R. Zerbino ◽  
Adam Frankish ◽  
Paul Flicek

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.



2018 ◽  
Vol 25 (6) ◽  
pp. 529-540
Author(s):  
Yongan Zhao ◽  
Xiaofeng Wang ◽  
Haixu Tang


2019 ◽  
Vol 41 (3) ◽  
pp. 46-48
Author(s):  
Jon M. Laurent ◽  
Sudarshan Pinglay ◽  
Leslie Mitchell ◽  
Ran Brosh

Less than 2% of our genome is protein-coding DNA. The vast expanses of non-coding DNA make up the genome's “dark matter”, where introns, repetitive and regulatory elements reside. Variation between individuals in non-coding regulatory DNA is emerging as a major factor in the genetics of numerous diseases and traits, yet very little is known about how such variations contribute to disease risk. Studying the genetics of regulatory variation is technically challenging as regulatory elements can affect genes located tens of thousands of base pairs away, and often, multiple distal regulatory variations, each with a very small effect, combine in an unknown way to significantly modulate the expression of genes. At the Center for Synthetic Regulatory Genomics (SyRGe) we directly tackle these problems in order to systematically elucidate the mechanisms of regulatory variation underlying human disease.



Sign in / Sign up

Export Citation Format

Share Document