Towards the Human Cancer Genome Project: A Sequence-Ready Physical Map of a Follicular Lymphoma Genome.

Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 605-605
Author(s):  
Marco A. Marra ◽  
Martin Krzywinski ◽  
Readman Chiu ◽  
Matthew Field ◽  
Inanc Birol ◽  
...  

Abstract With the aim of identifying and sequencing mutations in follicular lymphoma genomes, we have begun a project to generate at least 24 deeply redundant sequence-ready Bacterial Artificial Clone (BAC) - based whole genome maps, each from a different individual’s lymphoma. BAC-array CGH and Affymetrix whole-genome sampling assays (WGSA) will be used along with the mapping data to identify genomic amplifications and losses in the lymphomas. Results from the mapping and array studies will be used to prioritize BAC clones for sequence analysis. Because each map will span essentially the entire genome of the corresponding lymphoma, we anticipate that essentially all regions of each tumor genome will be represented in easily sequenced BAC clones. This approach facilitates targeted sequencing of genomic regions of interest, including those containing genes relevant to cancer or harboring amplifications or deletions. Our mapping strategy hinges on the successful creation of deeply redundant high quality BAC libraries from primary lymphomas and large scale high throughput restriction enzyme fingerprinting of individual BACs with a version of the technology we used to map the human, mouse, rat and other genomes. The effort is large-scale, and will result in the generation of at least 2.5 million fingerprinted BAC clones over the next three years. Using the fingerprints, we will align the BACs to the reference human genome to assess genome coverage and to identify candidate genome rearrangements. In parallel, we will assemble the fingerprints into genome maps, looking for larger-scale genome variations between the lymphoma maps and the reference genome sequence. To test the feasibility of our approach, we obtained two restriction digest fingerprints from each of 140,000 individual BAC clones. BACs were sampled from a 7-fold redundant BAC library that had been created from genomic DNA purified from a primary follicular lymphoma sample. The fingerprints are being assembled into a clone map with the intent of reconstructing the entire tumor genome. 90,377 fingerprinted clones with unambiguous single alignments to the reference sequence were automatically assembled into 15,538 contigs. Subsequent rounds of semi-automatic contig merging further reduced the number of contigs to 5,433. Only 1,241 clones remained unassembled. We anchored the tumor genome map to the reference human genome sequence by aligning the clone fingerprints to the restriction map computed from the reference sequence assembly. As a result of this, we identified a BAC that captured the canonical t(14;18) translocation characteristic of follicular lymphomas. We sequenced this BAC and confirmed that it contains the expected translocation. Almost 2.6 gigabases (~91%) of the reference genome are represented in the evolving map, with an additional 50,000 clone fingerprints awaiting incorporation into the map assembly. Among these are repeat-rich and other clones that may well harbor genome rearrangements. Additional prioritization of sequencing targets will be undertaken when map construction and analysis of genome copy number alterations are complete.

Author(s):  
Karen H. Miga ◽  
Ting Wang

The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome. Expected final online publication date for the Annual Review of Genomics and Human Genetics, Volume 22 is August 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


2017 ◽  
Author(s):  
Patrick Marks ◽  
Sarah Garcia ◽  
Alvaro Martinez Barrio ◽  
Kamila Belhocine ◽  
Jorge Bernate ◽  
...  

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.


2019 ◽  
Author(s):  
Gianpiero Marconi ◽  
Stefano Capomaccio ◽  
Cinzia Comino ◽  
Alberto Acquadro ◽  
Ezio Portis ◽  
...  

AbstractMethods for investigating DNA methylation nowadays either require a reference genome and high coverage, or investigate only CG methylation. Moreover, no large-scale analysis can be performed for N6-methyladenosine (6mA). Here we describe the methylation content sensitive enzyme double-digest restriction-site-associated DNA (ddRAD) technique (MCSeEd), a reduced-representation, reference-free, cost-effective approach for characterizing whole genome methylation patterns across different methylation contexts (e.g., CG, CHG, CHH, 6mA). MCSeEd can also detect genetic variations among hundreds of samples. MCSeEd is based on parallel restrictions carried out by combinations of methylation insensitive and sensitive endonucleases, followed by next-generation sequencing. Moreover, we present a robust bioinformatic pipeline (available at https://bitbucket.org/capemaster/mcseed/src/master/) for differential methylation analysis combined with single nucleotide polymorphism calling without or with a reference genome.


2020 ◽  
Author(s):  
Caroline Charre ◽  
Christophe Ginevra ◽  
Marina Sabatier ◽  
Hadrien Regue ◽  
Grégory Destras ◽  
...  

AbstractSince the beginning of the COVID-19 outbreak, SARS-CoV-2 whole-genome sequencing (WGS) has been performed at unprecedented rate worldwide with the use of very diverse Next Generation Sequencing (NGS) methods. Herein, we compare the performance of four NGS-based approaches for SARS-CoV-2 WGS. Twenty four clinical respiratory samples with a large scale of Ct values (from 10.7 to 33.9) were sequenced with four methods. Three used Illumina sequencing: an in-house metagenomic NGS (mNGS) protocol and two newly commercialized kits including a hybridization capture method developed by Illumina (DNA Prep with Enrichment kit and Respiratory Virus Oligo Panel, RVOP) and an amplicon sequencing method developed by Paragon Genomics (CleanPlex SARS-CoV-2 kit). We also evaluated the widely used amplicon sequencing protocol developed by ARTIC Network and combined with Oxford Nanopore Technologies (ONT) sequencing. All four methods yielded near-complete genomes (>99%) for high viral loads samples, with mNGS and RVOP producing the most complete genomes. For mid viral loads, 2/8 and 1/8 genomes were incomplete (<99%) with mNGS and both CleanPlex and RVOP, respectively. For low viral loads (Ct ≥25), amplicon-based enrichment methods were the most sensitive techniques yielding complete genomes for 7/8 samples. All methods were highly concordant in terms of identity in complete consensus sequence. Just one mismatch in two samples was observed in CleanPlex vs the other methods, due to the dedicated bioinformatics pipeline setting a high threshold to call SNP compared to reference sequence. Importantly, all methods correctly identified a newly observed 34-nt deletion in ORF6 but required specific bioinformatic validation for RVOP. Finally, as a major warning for targeted techniques, a default of coverage in any given region of the genome should alert to a potential rearrangement or a SNP in primer annealing or probe-hybridizing regions and would require regular updates of the technique according to SARS-CoV-2 evolution.


2020 ◽  
Vol 21 (1) ◽  
pp. 55-79 ◽  
Author(s):  
Daniel R. Zerbino ◽  
Adam Frankish ◽  
Paul Flicek

Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.


2020 ◽  
Vol 66 (1) ◽  
pp. 39-52
Author(s):  
Tomoya Tanjo ◽  
Yosuke Kawai ◽  
Katsushi Tokunaga ◽  
Osamu Ogasawara ◽  
Masao Nagasaki

AbstractStudies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Gianpiero Marconi ◽  
Stefano Capomaccio ◽  
Cinzia Comino ◽  
Alberto Acquadro ◽  
Ezio Portis ◽  
...  

Abstract Methods for investigating DNA methylation nowadays either require a reference genome and high coverage, or investigate only CG methylation. Moreover, no large-scale analysis can be performed for N6-methyladenosine (6 mA) at an affordable price. Here we describe the methylation content sensitive enzyme double-digest restriction-site-associated DNA (ddRAD) technique (MCSeEd), a reduced-representation, reference-free, cost-effective approach for characterizing whole genome methylation patterns across different methylation contexts (e.g., CG, CHG, CHH, 6 mA). MCSeEd can also detect genetic variations among hundreds of samples. MCSeEd is based on parallel restrictions carried out by combinations of methylation insensitive and sensitive endonucleases, followed by next-generation sequencing. Moreover, we present a robust bioinformatic pipeline (available at https://bitbucket.org/capemaster/mcseed/src/master/) for differential methylation analysis combined with single nucleotide polymorphism calling without or with a reference genome.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 2332-2332
Author(s):  
Yan Zheng ◽  
Ti-Cheng Chang ◽  
Gang Wu ◽  
Jane S. Hankins ◽  
Mitchell J. Weiss ◽  
...  

Abstract Introduction RBC alloimmunization is common in patients with sickle cell disease (SCD). Despite serological matching RBCs for major Rh antigens, Rh alloimmunization remains problematic. The Rh blood group is encoded by two genes RHD and RHCE, which exhibit extensive nucleotide polymorphism and chromosome structural changes, resulting in the formation of Rh variant antigens. Rh variants can result in loss of protein epitopes or expression of neo-epitopes, and are common in SCD patients. Hence SCD patients harboring Rh variants can be predisposed to Rh alloimmunization. Given the limitation of traditional serologic antigen typing for detection of Rh variants, molecular genotyping has become required. A DNA microarray-based platform, BioArray RHCE and RHD BeadChip (Immuncor) is available for RH genotyping. However, it detects the most common, but not all, variants. Whole exome sequence data have been used for prediction of Rh variants (Chou, et. al, Blood Adv., 2017), offer some advantages, including detection of rare variants, structural rearrangements and copy number variation. However, whole genome sequence (WGS) analysis of RHD/RHCE is challenging due to difficulties in mapping next generation sequencing (NGS) reads to this duplicated gene family. We developed a computational algorithm to identify RH variants using WGS data. Methods The pipeline included three major components, RH allele database construction, RH variant calling, and classification of Rh blood group according the identified variants. The RH allele database was built based on NCBI Blood Group Antigen Gene Mutation (BGMUT) and International Society of Blood Transfusion (ISBT) database. Since the alleles in the BGMUT and ISBT databases were specified according to conventional RH genes (RHD, L08429; RHCE, DQ322275) that are different from those on reference human genome, we first called the variations based on the reference human genome. The positions of the identified variations were subsequently corrected to match with the BGMUT and ISBT annotation system. Next, the NGS reads with low base quality and/or mapping quality were discarded during the variation calling step. Synonymous and non-synonymous amino acid changes were characterized for each polymorphism. Haplotypes were constructed for the segments with NGS read support. Gene sequencing coverage was calculated to determine gene deletions or amplifications. Lastly, we implemented an algorithm to predict RH genotypes based on a selection of candidate alleles by read-mapping profile which considers both sequence variations and sequence consistency followed by a likelihood-based ranking of all pairwise combinations of the selected alleles. The allele combination with the highest likelihood is considered the most likely pair of alleles at a given locus. Patient specimens used in this study were from participants of the Sickle Cell Clinical Research and Intervention Program (SCCRIP, Hankins et al. Pediatr Blood Cancer. 2018). Results We validated our method in a cohort of 58 SCD patients whose RH genotypes had been determined by BioArray RhCE and RhD BeadChip and supplementary molecular tests that identify the most common variants among individuals of African descent. In this validation cohort including a total of 11 RHD and 13 RHCE alleles, our approach achieved a concordance rate of 85.85% (91 of 106 alleles) for RHD and 83.02% (88 of 106 alleles) for RHCE genotyping. WGS was highly sensitive in distinguishing homozygosity from heterozygosity of genes. By comparing the numbers of NGS reads on RH regions and whole genome average coverage, heterozygous deletion can be determined. Since WGS provides comprehensive genotyping, our analysis identified single nucleotide polymorphisms that were not identified by the BeadChip and supplemental molecular testing. The final source of discordance was likely due to the short read length of NGS such that haplotype phases cannot be correctly predicted if the variations are separated by thousands of base pairs, for which long read DNA sequencing or RNA/cDNA sequencing are required. Evaluation of the identified discrepancies is ongoing. Conclusions We developed and validated a diagnostic method for RH genotyping that leveraged the accuracy and flexibility of RH genotyping based on WGS data. With further optimization of our method, this may be useful for RBC genotype matching sickle cell patients to blood donors in the future. Disclosures Hankins: Novartis: Research Funding; Global Blood Therapeutics: Research Funding; NCQA: Consultancy; bluebird bio: Consultancy.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
C. A. Samson ◽  
W. Whitford ◽  
R. G. Snell ◽  
J. C. Jacobsen ◽  
K. Lehnert

Abstract Cells obtained from human saliva are commonly used as an alternative DNA source when blood is difficult or less convenient to collect. Although DNA extracted from saliva is considered to be of comparable quality to that derived from blood, recent studies have shown that non-human contaminating DNA derived from saliva can confound whole genome sequencing results. The most concerning complication is that non-human reads align to the human reference genome using standard methodology, which can critically affect the resulting variant genotypes identified in a genome. We identified clusters of anomalous variants in saliva DNA derived reads which aligned in an atypical manner. These reads had only short regions of identity to the human reference sequence, flanked by soft clipped sequence. Sequence comparisons of atypically aligning reads from eight human saliva-derived samples to RefSeq genomes revealed the majority to be of bacterial origin (63.46%). To partition the non-human reads during the alignment step, a decoy of the most prevalent bacterial genome sequences was designed and utilised. This reduced the number of atypically aligning reads when trialled on the eight saliva-derived samples by 44% and most importantly prevented the associated anomalous genotype calls. Saliva derived DNA is often contaminated by DNA from other species. This can lead to non-human reads aligning to the human reference genome using current alignment best-practices, impacting variant identification. This problem can be diminished by using a bacterial decoy in the alignment process.


Blood ◽  
2009 ◽  
Vol 114 (22) ◽  
pp. 439-439
Author(s):  
Andrew J. Mungall ◽  
Andy Chu ◽  
Readman Chiu ◽  
Richard Corbett ◽  
Matthew A. Field ◽  
...  

Abstract Abstract 439 Introduction: Follicular lymphoma (FL) is the most common indolent lymphoid malignancy in North America with approximately 20,000 new cases of this incurable cancer diagnosed each year. In approximately 85% of patients, FL is associated with the reciprocal translocation t(14;18)(q32;q21), which results in a fusion between IGH and BCL2 genes and consequent over-expression of the anti-apoptotic protein BCL2. This translocation likely represents an initiating event for FL, requiring additional mutational events for the onset of clinical disease. To investigate the relationship between genome rearrangements and FL we identified rearrangement locations in the genome followed by detailed, fine-structure analysis of the rearrangements to ascertain their effects on genes and other features of biological interest. Patients and Methods: We used a whole-genome bacterial artificial chromosome (BAC) fingerprint-based approach, termed Fingerprint Profiling (FPP, Krzywinski, M. et al. 2007), to detect genome rearrangements relative to the reference human genome in neoplastic B cells purified from 24 FL patient biopsies. Analysis of 2,640,707 BAC fingerprints revealed 721 candidate genomic rearrangements. To validate these observations and provide base-pair resolution of the rearrangement breakpoints we performed paired-end massively parallel sequencing, on the Illumina Genome Analyzer II platform, of the breakpoint-containing regions captured in the BAC clones. Sequence reads were assembled into contigs using our in-house de novo assembly algorithm ABySS (Assembly By Short Sequences, Simpson, J. et al. 2009) then aligned to the reference human genome. Following manual annotation of the breakpoint junctions PCR primers were designed to assay patient tumour and matched constitutional DNA and thus determine whether the observed genome rearrangements were somatic (acquired) or germline in origin. Results: 727 BACs with apparent large-scale genome rearrangements, representing 354 distinct genome rearrangements across 20 patients, were sequenced in 95 pools, generating 72 Gbp of sequence. The 354 distinct events include 163 deletions, 71 inversions, 27 insertions, 83 translocations and 10 duplications, ranging in size from 3 kb to 67 Mb. PCR assays for 194 of the distinct events have been performed thus far identifying 80 distinct somatic and 114 germline-derived structural variations at base-pair resolution. Of the somatic events 5 are present in two or more of the 20 patients analyzed including a 720 kb inversion of 3q27.3 that results in expression of a BCL6-ST6GAL1 fusion transcript. Identification at base-pair resolution of breakpoint sequences enabled a detailed study of breakpoint and fusion mechanisms. We classified breakpoint junctions into 4 groups; those with microhomology (48%), those with sequence additions (28%), those with blunt fusions (20%) and those with flanking low copy repeats (4%). We were particularly interested in establishing the origin of the observed nucleotide sequence additions in 97 breakpoint junctions. The sequence additions ranged in size from a single nucleotide to 454 bp. In one case we have unambiguously mapped a 53 bp sequence, lying within one of the 3q27.3 inversion breakpoints, to chromosome 5q12.3. This finding is consistent with the recently proposed fork stalling and template switching (FoSTeS) DNA replication-based mechanism and thus represents a novel mechanism in FL lymphomagenesis. Conclusions: We have successfully employed high-throughput clone fingerprinting and sequencing to identify numerous novel somatic and germline genome rearrangements from FL primary tumour samples. Furthermore, base-pair resolution of rearrangement breakpoints provides mechanistic insights. With the complete inventory of somatic and germline events in hand we will be able to propose recurrent structurally altered genes in FL patients for validation in independent datasets and improve our understanding of FL biology. Pathway analyses to identify emerging themes from somatic mutations are also being performed. The PCR assays we have developed will also be of utility in identifying germline predisposition alleles in larger FL patient cohorts. Disclosures: No relevant conflicts of interest to declare.


Sign in / Sign up

Export Citation Format

Share Document