Haplotype-resolved genomes of geminivirus-resistant and geminivirus-susceptible African cassava cultivars

Abstract Background Cassava is an important food crop in tropical and sub-tropical regions worldwide. In Africa, cassava production is widely affected by cassava mosaic disease (CMD), which is caused by the African cassava mosaic geminivirus that is transmitted by whiteflies. Cassava breeders often use a single locus, CMD2, for introducing CMD resistance into susceptible cultivars. The CMD2 locus has been genetically mapped to a 10-Mbp region, but its organization and genes as well as their functions are unknown. Results We report haplotype-resolved de novo assemblies and annotations of the genomes for the African cassava cultivar TME (tropical Manihot esculenta), which is the origin of CMD2, and the CMD-susceptible cultivar 60444. The assemblies provide phased haplotype information for over 80% of the genomes. Haplotype comparison identified novel features previously hidden in collapsed and fragmented cassava genomes, including thousands of allelic variants, inter-haplotype diversity in coding regions, and patterns of diversification through allele-specific expression. Reconstruction of the CMD2 locus revealed a highly complex region with nearly identical gene sets but limited microsynteny between the two cultivars. Conclusions The genome maps of the CMD2 locus in both 60444 and TME3, together with the newly annotated genes, will help the identification of the causal genetic basis of CMD2 resistance to geminiviruses. Our de novo cassava genome assemblies will also facilitate genetic mapping approaches to narrow the large CMD2 region to a few candidate genes for better informed strategies to develop robust geminivirus resistance in susceptible cassava cultivars.

Download Full-text

Integrative genomic analysis unifying epigenetic inheritance in adaptation and canalization

10.1101/849620 ◽

2019 ◽

Author(s):

Abhay Sharma

Keyword(s):

De Novo ◽

Genomic Analysis ◽

Epigenetic Inheritance ◽

Recent Analysis ◽

Evolutionary Significance ◽

Evolutionary Adaptation ◽

Specific Expression ◽

Gene Sets ◽

Specific Finding ◽

Allele Specific

AbstractEpigenetic inheritance, especially its biomedical and evolutionary significance, is an immensely interesting but highly controversial subject. Notably, a recent analysis of existing multi-omics has supported the mechanistic plausibility of epigenetic inheritance and its implications in disease and evolution. The evolutionary support stemmed from the specific finding that genes associated with cold induced inheritance and with latitudinal adaptation in mice are exceptionally common. Here, a similar gene set overlap analysis is presented that integrates cold induced inheritance with evolutionary adaptation and genetic canalization in cold environment in Drosophila. Genes showing differential expression in inheritance specifically overrepresent gene sets associated with differential and allele specific expression, though not with genome-wide genetic differentiation, in adaptation. On the other hand, the differentiated outliers uniquely overrepresent genes dysregulated by radicicol, a decanalization inducer. Both gene sets in turn exclusively show enrichment of genes that accumulate, in intended experimental lines, de novo mutations, a potential source of canalization. Successively, the three gene sets distinctively overrepresent genes exhibiting, between mutation accumulation lines, invariable expression, a potential signal for canalization. Sequentially, the four gene sets solely display enrichment of genes grouped in gene ontology under transcription factor activity, a signature of regulatory canalization. Cumulatively, the analysis suggests that epigenetic inheritance possibly contributes to evolutionary adaptation in the form of cis regulatory variations, with trans variations arising in the course of genetic canalization.

Download Full-text

The haplotype-resolved chromosome pairs and transcriptome of a heterozygous diploid African cassava cultivar

10.1101/2021.11.16.468774 ◽

2021 ◽

Author(s):

Weihong Qi ◽

Yi-Wen Lim ◽

Andrea Patrignani ◽

Pascal Schlaepfer ◽

Anna Bratus-Neuenschwander ◽

...

Keyword(s):

Expression Patterns ◽

Cost Effective ◽

Structural Variations ◽

Specific Expression ◽

Tropical Regions ◽

Reference Quality ◽

Allele Specific ◽

High Gene ◽

Genome Assemblies ◽

Genome Comparisons

Background: Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and sub-tropical regions worldwide. Genetic gain by molecular breeding is limited because cassava has a highly heterozygous, repetitive and difficult to assemble genome. Findings: Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present two chromosome scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. Genome comparisons revealed extensive chromosome re-arrangements and abundant intra-genomic and inter-genomic divergent sequences despite high gene synteny, with most large structural variations being LTR-retrotransposon related. Allele-specific expression analysis of different tissues based on the haplotype-resolved transcriptome identified both stable and inconsistent alleles with imbalanced expression patterns, while most alleles expressed coordinately. Among tissue-specific differentially expressed transcripts, coordinately and biasedly regulated transcripts were functionally enriched for different biological processes. We use the reference-quality assemblies to build a cassava pan-genome and demonstrate its importance in representing the genetic diversity of cassava for downstream reference-guided omics analysis and breeding. Conclusions: The haplotype-resolved genome allows the first systematic view of the heterozygous diploid genome organization in cassava. The completely phased and annotated chromosome pairs will be a valuable resource for cassava breeding and research. Our study may also provide insights into developing cost-effective and efficient strategies for resolving complex genomes with high resolution, accuracy and continuity.

Download Full-text

Modeling allele-specific expression at the gene and SNP levels simultaneously by a Bayesian logistic mixed regression model

BMC Bioinformatics ◽

10.1186/s12859-019-3141-6 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Jing Xie ◽

Tieming Ji ◽

Marco A. R. Ferreira ◽

Yahan Li ◽

Bhaumik N. Patel ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Mixed Model ◽

Linear Mixed Model ◽

De Novo ◽

Bovine Genome ◽

Whole Genome Analysis ◽

Specific Expression ◽

Allele Specific Expression ◽

Allele Specific

Abstract Background High-throughput sequencing experiments, which can determine allele origins, have been used to assess genome-wide allele-specific expression. Despite the amount of data generated from high-throughput experiments, statistical methods are often too simplistic to understand the complexity of gene expression. Specifically, existing methods do not test allele-specific expression (ASE) of a gene as a whole and variation in ASE within a gene across exons separately and simultaneously. Results We propose a generalized linear mixed model to close these gaps, incorporating variations due to genes, single nucleotide polymorphisms (SNPs), and biological replicates. To improve reliability of statistical inferences, we assign priors on each effect in the model so that information is shared across genes in the entire genome. We utilize Bayesian model selection to test the hypothesis of ASE for each gene and variations across SNPs within a gene. We apply our method to four tissue types in a bovine study to de novo detect ASE genes in the bovine genome, and uncover intriguing predictions of regulatory ASEs across gene exons and across tissue types. We compared our method to competing approaches through simulation studies that mimicked the real datasets. The R package, BLMRM, that implements our proposed algorithm, is publicly available for download at https://github.com/JingXieMIZZOU/BLMRM. Conclusions We will show that the proposed method exhibits improved control of the false discovery rate and improved power over existing methods when SNP variation and biological variation are present. Besides, our method also maintains low computational requirements that allows for whole genome analysis.

Download Full-text

How well can we create phased, diploid, human genomes?: An assessment of FALCON-Unzip phasing using a human trio

10.1101/262196 ◽

2018 ◽

Cited By ~ 4

Author(s):

Arkarachai Fungtammasan ◽

Brett Hannigan

Keyword(s):

De Novo ◽

Personal Genome ◽

Specific Expression ◽

Human Genomes ◽

Future Improvement ◽

Long Read ◽

Allele Specific ◽

Personal Genomes ◽

Reference Genomes ◽

Haplotype Information

ABSTRACTLong read sequencing technology has allowed researchers to create de novo assemblies with impressive continuity[1,2]. This advancement has dramatically increased the number of reference genomes available and hints at the possibility of a future where personal genomes are assembled rather than resequenced. In 2016 Pacific Biosciences released the FALCON-Unzip framework, which can provide long, phased haplotype contigs from de novo assemblies. This phased genome algorithm enhances the accuracy of highly heterozygous organisms and allows researchers to explore questions that require haplotype information such as allele-specific expression and regulation. However, validation of this technique has been limited to small genomes or inbred individuals[3].As a roadmap to personal genome assembly and phasing, we assess the phasing accuracy of FALCON-Unzip in humans using publicly available data for the Ashkenazi trio from the Genome in a Bottle Consortium[4]. To assess the accuracy of the Unzip algorithm, we assembled the genome of the son using FALCON and FALCON Unzip, genotyped publicly available short read data for the mother and the father, and observed the inheritance pattern of the parental SNPs along the phased genome of the son. We found that 72.8% of haplotype contigs share SNPs with only one parent suggesting that these contigs are correctly phased. Most mis-phased SNPs are random but present in high frequency toward the end of haplotype contigs. Approximately 20.7% of mis-phased haplotype contigs contain clusters of mis-phased SNPs, suggesting that haplotypes were mis-joined by FALCON-Unzip. Mis-joined boundaries in those contigs are located in areas of low SNP density. This research demonstrates that the FALCON-Unzip algorithm can be used to create long and accurate haplotypes for humans and identifies problematic regions that could benefit in future improvement.

Download Full-text

Sex-specific exons control DNA methyltransferase in mammalian germ cells

Development ◽

10.1242/dev.125.5.889 ◽

1998 ◽

Vol 125 (5) ◽

pp. 889-897 ◽

Cited By ~ 2

Author(s):

C. Mertineit ◽

J.A. Yoder ◽

T. Taketo ◽

D.W. Laird ◽

J.M. Trasler ◽

...

Keyword(s):

Germ Cells ◽

Dna Methyltransferase ◽

De Novo ◽

Male Meiosis ◽

Specific Expression ◽

Oocyte Growth ◽

Methyl Transferase ◽

Developmental Potential ◽

Allele Specific ◽

Methylation Patterns

The spermatozoon and oocyte genomes bear sex-specific methylation patterns that are established during gametogenesis and are required for the allele-specific expression of imprinted genes in somatic tissues. The mRNA for Dnmt1, the predominant maintenance and de novo DNA (cytosine-5)-methyl transferase in mammals, is present at high levels in postmitotic murine germ cells but undergoes alternative splicing of sex-specific 5′ exons, which controls the production and localization of enzyme during specific stages of gametogenesis. An oocyte-specific 5′ exon is associated with the production of very large amounts of active Dnmt1 protein, which is truncated at the N terminus and sequestered in the cytoplasm during the later stages of oocyte growth, while a spermatocyte-specific 5′ exon interferes with translation and prevents production of Dnmt1 during the prolonged crossing-over stage of male meiosis. During the course of postnatal oogenesis, Dnmt1 is present at high levels in nuclei only in growing dictyate oocytes, a stage during which gynogenetic developmental potential is lost and biparental developmental potential is gained.

Download Full-text

Allele-Specific Expression of GATA2 in AML with CEBPA Biallelic Mutations

Blood ◽

10.1182/blood-2019-129766 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 1235-1235

Author(s):

Roger Mulet-Lazaro ◽

Stanley van Herk ◽

Claudia Erpelinck-Verschueren ◽

Mathijs A. Sanders ◽

Eric Bindels ◽

...

Keyword(s):

Dna Methylation ◽

De Novo ◽

Regulatory Elements ◽

P Value ◽

Specific Expression ◽

Allele Specific ◽

De Novo Aml ◽

Cebpa Mutations ◽

Biallelic Mutations ◽

In Cis

Introduction Transcriptional deregulation is a central event in the development of acute myeloid leukemia (AML), with most mutations occurring in genes related to transcription, chromatin regulation and DNA methylation. Furthermore, alterations involving cis-regulatory elements have been shown to play a critical role in aberrant gene expression in AML. Genetic variation in cis-regulatory regions usually involves a single allele, which results in differential expression of the two alleles. This phenomenon, termed allele-specific expression (ASE), is therefore an accurate marker for cis-regulatory variation (Pastinen, 2010). We propose that a systematic study of genes with aberrant ASE in AML may uncover aberrantly expressed genes caused by abnormalities in cis-regulatory elements. Therefore we aim to 1) chart the landscape of ASE in AML, 2) establish a link between relevant ASE events and AML subtypes, and 3) investigate the mechanisms driving ASE. Methods We performed whole exome sequencing (WES) and RNA-seq on leukemic blasts from 168 de novo AML patients, representing all major subtypes of the disease. Combining both datasets, we assessed ASE in every gene with informative (non-homozygous) single nucleotide variants (SNVs). Results Patients had a median of 37 genes with ASE, several of which were recurrently detected across multiple patients. To shorten the gene list we selected for this study genes known to be involved either in cancer or in myeloid development. The gene most commonly found to show ASE (53/140 cases with SNVs) was GATA2, which encodes a transcription factor crucial for proliferation and maintenance of hematopoietic stem cells with a known involvement in AML. Interestingly, integration with molecularly defined classification of AML revealed that all cases (n=17) with biallelic CEBPA mutations exhibited GATA2 ASE (p-value = 6.00·10-7, Fisher's test). Biallelic CEBPA mutations (CEBPA DM) identify an AML subtype with favorable clinical outcome and frequently co-occur with GATA2 mutations (Greif PA, 2012), pointing to a functional connection between these two genes. Indeed, 44% of the cases in our cohort exhibited a GATA2 mutation, and 27% carried a second, subclonal mutation in the same gene. Importantly, in cases where a GATA2 mutation was found, the mutant allele was always preferentially expressed. These findings were validated in the TCGA dataset, where all four CEBPA DM patients with informative SNVs in GATA2 exhibited GATA2 ASE. Although GATA2 ASE was present in other AML subtypes, none of these subtypes showed a significant association with this finding. Patients with a t(8;21) rearrangement (n=5), which represses CEBPA expression, did not exhibit GATA2 ASE, and we only observed GATA2 ASE in 4 out of 8 CEBPA silenced leukemias (Wouters BJ, 2007). Altogether, this demonstrates the uniqueness of the 1-to-1 relationship between CEBPA DM and GATA2 ASE, and excludes a causative role for inactive CEBPA protein in mediating mono-allelic expression of GATA2. The average expression of GATA2 in CEBPA DM patients was comparable to other AMLs, even in cases with monoallelic GATA2 expression. This suggests that a) ASE was achieved by repression of one allele rather than dramatically increased expression of the other, b) there was a compensation of the non-repressed allele. DNA methylation analysis of the GATA2 promoter did not reveal methylation-mediated gene silencing of the repressed allele. The long-distance +77 kb GATA2 enhancer appears to be involved in ASE, as RNA read-through levels at the enhancer were significantly different in CEBPA DM AMLs (p-value < 10-4, Wald test) in an allele-specific manner. The involvement of the enhancer was further confirmed by differences in H3K27ac levels between the two alleles. Conclusions An unbiased screen of 168 de novo AML cases revealed that all patients (n=17) with CEBPA biallelic mutations display GATA2 ASE. GATA2 mutations were found in 8 of the 17 cases, always in the allele that is preferentially expressed. Since GATA2 ASE is present in all CEBPA DM and GATA2 mutations only in a fraction, we hypothesize that GATA2 ASE is acquired first and mutations are only selected if they occur in the expressed allele. Moreover, given that other subgroups with CEBPA abnormalities do not show a similar pattern, we propose that ASE of GATA2 is not a consequence of CEBPA mutations, but rather a requirement for the development of AML in these patients. Disclosures No relevant conflicts of interest to declare.

Download Full-text

Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication

Nature Genetics ◽

10.1038/s41588-020-00723-9 ◽

2020 ◽

Vol 52 (12) ◽

pp. 1423-1432 ◽

Cited By ~ 2

Author(s):

Xuepeng Sun ◽

Chen Jiao ◽

Heidi Schwaninger ◽

C. Thomas Chao ◽

Yumin Ma ◽

...

Keyword(s):

Developmental Stages ◽

Hybrid Origin ◽

Selective Sweeps ◽

Specific Expression ◽

Genetic History ◽

New Genes ◽

History Of ◽

Allele Specific ◽

Genome Assemblies ◽

Wild Progenitors

AbstractDomestication of the apple was mainly driven by interspecific hybridization. In the present study, we report the haplotype-resolved genomes of the cultivated apple (Malus domestica cv. Gala) and its two major wild progenitors, M. sieversii and M. sylvestris. Substantial variations are identified between the two haplotypes of each genome. Inference of genome ancestry identifies ~23% of the Gala genome as of hybrid origin. Deep sequencing of 91 accessions identifies selective sweeps in cultivated apples that originated from either of the two progenitors and are associated with important domestication traits. Construction and analyses of apple pan-genomes uncover thousands of new genes, with hundreds of them being selected from one of the progenitors and largely fixed in cultivated apples, revealing that introgression of new genes/alleles is a hallmark of apple domestication through hybridization. Finally, transcriptome profiles of Gala fruits at 13 developmental stages unravel ~19% of genes displaying allele-specific expression, including many associated with fruit quality.

Download Full-text

RH mapping by sequencing: chromosome-scale assembly of the duck genome

10.1101/846840 ◽

2019 ◽

Author(s):

Man Rao ◽

Alain Vignal ◽

Mireille Morisson ◽

Valérie Fillon ◽

Sophie Leroux ◽

...

Keyword(s):

De Novo ◽

Chromosomal Rearrangements ◽

Genotyping By Sequencing ◽

Critical Issue ◽

Chromosomal Breakage ◽

Genome Maps ◽

Comparative Maps ◽

Mapping Process ◽

Rh Panel ◽

Genome Assemblies

AbstractLike many other species, the duck genome has been sequenced thanks to the technological breakthrough provided by the emergence of Next Generation Sequencing (NGS). The resulting de novo assemblies are however made of thousands of scattered scaffolds. To achieve chromosome-scale contiguity, long-range intermediate genome maps remain indispensable. Radiation Hybrid (RH) maps have been used to assist the generation of chromosome-scale genome assemblies by taking advantage of the high density SNP chips that provide a large number of markers that can be efficiently genotyped on the panel.In the absence of such a resource in duck, we sequenced 100 hybrid clones of a duck RH panel enabling direct genotyping of the assembly scaffolds on the panel. The rationale is to use scaffolds as markers and to genotype the scaffolds by sequencing the clones: the presence/absence of a scaffold in a particular sequenced hybrid is attested by the presence/absence of reads mapping specifically to this scaffold. The detection of scaffolds exhibiting a chromosomal breakage resulting from the irradiation process revealed itself to be a critical issue of this genotyping by sequencing process. This process resulted in the construction of RH vectors for 2,027 scaffolds, representing a total of about 1 Gb of sequences (95% of the current Duck genome assembly). The subsequent linkage analysis enabled the construction of RH maps and therefore to organize, i.e. order and orient, the scaffolds into pseudomolecules associated to the corresponding duck chromosomes. We describe here the whole mapping process, from sequence-based genotyping to the construction of comparative maps, as well as few examples of intra-chromosomal rearrangements that have been identified by the comparison with the chicken, turkey and zebra finch genomes and subsequently confirmed by FISH.We describe a method to order and orient sequence scaffolds into super-scaffolds spanning entire chromosomes. The method, which requires a pre-existing RH panel and sequence scaffolds from an NGS assembly, relies on a shallow sequencing of the RH clones. This approach was applied to the duck genome and produced chromosome-scale scaffolds for 29 out of the 41 duck chromosomes.

Download Full-text

Haplotype-Resolved Cattle Genomes Provide Insights Into Structural Variation and Adaptation

10.1101/720797 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wai Yee Low ◽

Rick Tearle ◽

Ruijie Liu ◽

Sergey Koren ◽

Arang Rhie ◽

...

Keyword(s):

Copy Number Variants ◽

Fatty Acid Desaturase ◽

Gene Families ◽

Specific Reference ◽

Single Individual ◽

Extra Copy ◽

Specific Expression ◽

Genome Wide ◽

Allele Specific ◽

Genome Assemblies

AbstractWe present high quality, phased genome assemblies representative of taurine and indicine cattle, subspecies that differ markedly in productivity-related traits and environmental adaptation. We report a new haplotype-aware scaffolding and polishing pipeline using contigs generated by the trio binning method to produce haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle breeds. These assemblies were used to identify structural and copy number variants that differentiate the subspecies and we found variant detection was sensitive to the specific reference genome chosen. Six gene families with immune related functions are expanded in the indicine lineage. Assembly of the genomes of both subspecies from a single individual enabled transcripts to be phased to detect allele-specific expression, and to study genome-wide selective sweeps. An indicus-specific extra copy of fatty acid desaturase is under positive selection and may contribute to indicine adaptation to heat and drought.

Download Full-text

Integrative analysis of rare variants and pathway information shows convergent results between immune pathways, drug targets and epilepsy genes

10.1101/410100 ◽

2018 ◽

Author(s):

Hoang T. Nguyen ◽

Amanda Dobbyn ◽

Alexander W. Charney ◽

Julien Bryois ◽

April Kim ◽

...

Keyword(s):

Complex Disease ◽

De Novo ◽

Case Control ◽

Next Generation Sequencing Data ◽

Individual Risk ◽

Sequencing Data ◽

Specific Expression ◽

Risk Genes ◽

Gene Set ◽

Gene Sets

AbstractTrio family and case-control studies of next-generation sequencing data have proven integral to understanding the contribution of rare inherited and de novo single-nucleotide variants to the genetic architecture of complex disease. Ideally, such studies should identify individual risk genes of moderate to large effect size to generate novel treatment hypotheses for further follow-up. However, due to insufficient power, gene set enrichment analyses have come to be relied upon for detecting differences between cases and controls, implicating sets of hundreds of genes rather than specific targets for further investigation. Here, we present a Bayesian statistical framework, termed gTADA, that integrates gene-set membership information with gene-level de novo and rare inherited case-control counts, to prioritize risk genes with excess rare variant burden within enriched gene sets. Applying gTADA to available whole-exome sequencing datasets for several neuropsychiatric conditions, we replicated previously reported gene set enrichments and identified novel risk genes. For epilepsy, gTADA prioritized 40 risk genes (posterior probabilities > 0.95), 6 of which replicate in an independent whole-genome sequencing study. In addition, 30/40 genes are novel genes. We found that epilepsy genes had high protein-protein interaction (PPI) network connectivity, and show specific expression during human brain development. Some of the top prioritized EPI genes were connected to a PPI subnetwork of immune genes and show specific expression in prenatal microglia. We also identified multiple enriched drug-target gene sets for EPI which included immunostimulants as well as known antiepileptics. Immune biology was supported specifically by case-control variants from familial epilepsies rather than do novo mutations in generalized encephalitic epilepsy.

Download Full-text