Dense and accurate whole-chromosome haplotyping of individual genomes

ABSTRACTThe diploid nature of the genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. Many important biological phenomena such as compound heterozygosity and epistatic effects between enhancers and target genes, however, can only be studied when haplotype-resolved genomes are available. This lack of haplotype-level analyses can be explained by a dearth of methods to produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. Our experiments provide comprehensive guidance on favorable combinations of Strand-seq libraries and sequencing coverages to obtain complete and genome-wide haplotypes of a single individual genome (NA12878) at manageable costs. We were able to reliably assign > 95% of alleles to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different sequencing technologies represents an attractive solution to chart the unique genetic variation of diploid genomes.

Download Full-text

Next generation sequencing allows deeper analysis and understanding of genomes and transcriptomes including aspects to fertility

Reproduction Fertility and Development ◽

10.1071/rd10247 ◽

2011 ◽

Vol 23 (1) ◽

pp. 75 ◽

Cited By ~ 7

Author(s):

Thomas Werner

Keyword(s):

Next Generation Sequencing ◽

Transcriptional Control ◽

Target Genes ◽

De Novo ◽

Alternative Promoters ◽

Next Generation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Generation Sequencing

Reproduction and fertility are controlled by specific events naturally linked to oocytes, testes and early embryonal tissues. A significant part of these events involves gene expression, especially transcriptional control and alternative transcription (alternative promoters and alternative splicing). While methods to analyse such events for carefully predetermined target genes are well established, until recently no methodology existed to extend such analyses into a genome-wide de novo discovery process. With the arrival of next generation sequencing (NGS) it becomes possible to attempt genome-wide discovery in genomic sequences as well as whole transcriptomes at a single nucleotide level. This does not only allow identification of the primary changes (e.g. alternative transcripts) but also helps to elucidate the regulatory context that leads to the induction of transcriptional changes. This review discusses the basics of the new technological and scientific concepts arising from NGS, prominent differences from microarray-based approaches and several aspects of its application to reproduction and fertility research. These concepts will then be illustrated in an application example of NGS sequencing data analysis involving postimplantation endometrium tissue from cows.

Download Full-text

DNAModAnnot: a R toolbox for DNA modification filtering and annotation

Bioinformatics ◽

10.1093/bioinformatics/btab032 ◽

2021 ◽

Author(s):

Alexis Hardy ◽

Mélody Matelot ◽

Amandine Touzeau ◽

Christophe Klopp ◽

Céline Lopez-Roques ◽

...

Keyword(s):

Global Analysis ◽

R Package ◽

Supplementary Information ◽

Dna Modification ◽

Paramecium Tetraurelia ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Dna Modifications ◽

Long Read

Abstract Motivation Long-read sequencing technologies can be employed to detect and map DNA modifications at the nucleotide resolution on a genome-wide scale. However, published software packages neglect the integration of genomic annotation and comprehensive filtering when analyzing patterns of modified bases detected using Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data. Here, we present DNAModAnnot, a R package designed for the global analysis of DNA modification patterns using adapted filtering and visualization tools. Results We tested our package using PacBio sequencing data to analyze patterns of the 6-methyladenine (6 mA) in the ciliate Paramecium tetraurelia, in which high 6 mA amounts were previously reported. We found Paramecium tetraurelia 6 mA genome-wide distribution to be similar to other ciliates. We also performed 5-methylcytosine (5mC) analysis in human lymphoblastoid cells using ONT data and confirmed previously known patterns of 5mC. DNAModAnnot provides a toolbox for the genome-wide analysis of different DNA modifications using PacBio and ONT long-read sequencing data. Availability DNAModAnnot is distributed as a R package available via GitHub (https://github.com/AlexisHardy/DNAModAnnot) Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population

BMC Medical Genomics ◽

10.1186/s12920-020-00853-3 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C. Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using . Length variation of known disease-associated repeats was compared to other repeat loci. Results We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5′UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Genome-wide Survey of Tandem Repeats by Nanopore Sequencing Shows that Disease-associated Repeats are More Polymorphic in the General Population

10.21203/rs.3.rs-79348/v1 ◽

2020 ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background: Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods: We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using tandem-genotypes. Length variation of known disease-associated repeats was compared to other repeat loci. Results: We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5'UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions: We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study (GWAS), may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population

10.1101/2019.12.19.883389 ◽

2019 ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Mendelian Disease ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Human Genomes ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

AbstractTandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. However, it is difficult to predict which tandem repeats may cause a disease. We performed a genome-wide survey of the millions of human tandem repeats using long read genome sequencing data from 16 humans. We found that known Mendelian disease-causing or disease-associated repeats, especially coding CAG and 5’UTR GGC repeats, are relatively long and polymorphic in the general population. This method, especially if used in GWAS, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Quinoa genome assembly employing genomic variation for guided scaffolding

Theoretical and Applied Genetics ◽

10.1007/s00122-021-03915-x ◽

2021 ◽

Author(s):

Alexandrina Bodrug-Schepers ◽

Nancy Stralis-Pavese ◽

Hermann Buerstmayr ◽

Juliane C. Dohm ◽

Heinz Himmelbauer

Keyword(s):

Genome Assembly ◽

Chenopodium Quinoa ◽

Genomic Variation ◽

Valuable Resource ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

Haplotype Information

Abstract Key message We propose to use the natural variation between individuals of a population for genome assembly scaffolding. In today’s genome projects, multiple accessions get sequenced, leading to variant catalogs. Using such information to improve genome assemblies is attractive both cost-wise as well as scientifically, because the value of an assembly increases with its contiguity. We conclude that haplotype information is a valuable resource to group and order contigs toward the generation of pseudomolecules. Abstract Quinoa (Chenopodium quinoa) has been under cultivation in Latin America for more than 7500 years. Recently, quinoa has gained increasing attention due to its stress resistance and its nutritional value. We generated a novel quinoa genome assembly for the Bolivian accession CHEN125 using PacBio long-read sequencing data (assembly size 1.32 Gbp, initial N50 size 608 kbp). Next, we re-sequenced 50 quinoa accessions from Peru and Bolivia. This set of accessions differed at 4.4 million single-nucleotide variant (SNV) positions compared to CHEN125 (1.4 million SNV positions on average per accession). We show how to exploit variation in accessions that are distantly related to establish a genome-wide ordered set of contigs for guided scaffolding of a reference assembly. The method is based on detecting shared haplotypes and their expected continuity throughout the genome (i.e., the effect of linkage disequilibrium), as an extension of what is expected in mapping populations where only a few haplotypes are present. We test the approach using Arabidopsis thaliana data from different populations. After applying the method on our CHEN125 quinoa assembly we validated the results with mate-pairs, genetic markers, and another quinoa assembly originating from a Chilean cultivar. We show consistency between these information sources and the haplotype-based relations as determined by us and obtain an improved assembly with an N50 size of 1079 kbp and ordered contig groups of up to 39.7 Mbp. We conclude that haplotype information in distantly related individuals of the same species is a valuable resource to group and order contigs according to their adjacency in the genome toward the generation of pseudomolecules.

Download Full-text

The mutL Gene as a Genome-Wide Taxonomic Marker for High Resolution Discrimination of Lactiplantibacillus plantarum and Its Closely Related Taxa

Microorganisms ◽

10.3390/microorganisms9081570 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1570

Author(s):

Chien-Hsun Huang ◽

Chih-Chieh Chen ◽

Yu-Chun Lin ◽

Chia-Hsuan Chen ◽

Ai-Yun Lee ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Target Genes ◽

Marker Genes ◽

Rrna Gene ◽

Accurate Identification ◽

Discrimination Power ◽

Sequence Identity ◽

Genome Wide ◽

A Genome

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.

Download Full-text

Drosophila Gain-of-Function Mutant RTK Torso Triggers Ectopic Dpp and STAT Signaling

Genetics ◽

10.1093/genetics/164.1.247 ◽

2003 ◽

Vol 164 (1) ◽

pp. 247-258 ◽

Cited By ~ 1

Author(s):

Jinghong Li ◽

Willis X Li

Keyword(s):

Tyrosine Kinases ◽

Target Genes ◽

Tor Signaling ◽

Gain Of Function ◽

Essential Requirement ◽

Downstream Target ◽

Rtk Signaling ◽

Genome Wide ◽

A Genome ◽

Genomic Regions

Abstract Overactivation of receptor tyrosine kinases (RTKs) has been linked to tumorigenesis. To understand how a hyperactivated RTK functions differently from wild-type RTK, we conducted a genome-wide systematic survey for genes that are required for signaling by a gain-of-function mutant Drosophila RTK Torso (Tor). We screened chromosomal deficiencies for suppression of a gain-of-function mutation tor (torGOF), which led to the identification of 26 genomic regions that, when in half dosage, suppressed the defects caused by torGOF. Testing of candidate genes in these regions revealed many genes known to be involved in Tor signaling (such as those encoding the Ras-MAPK cassette, adaptor and structural molecules of RTK signaling, and downstream target genes of Tor), confirming the specificity of this genetic screen. Importantly, this screen also identified components of the TGFβ (Dpp) and JAK/STAT pathways as being required for TorGOF signaling. Specifically, we found that reducing the dosage of thickveins (tkv), Mothers against dpp (Mad), or STAT92E (aka marelle), respectively, suppressed torGOF phenotypes. Furthermore, we demonstrate that in torGOF embryos, dpp is ectopically expressed and thus may contribute to the patterning defects. These results demonstrate an essential requirement of noncanonical signaling pathways for a persistently activated RTK to cause pathological defects in an organism.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

Genome-Wide Analysis of Glucocorticoid-Responsive Transcripts in the Hypothalamic Paraventricular Region of Male Rats

Endocrinology ◽

10.1210/en.2018-00535 ◽

2018 ◽

Vol 160 (1) ◽

pp. 38-54 ◽

Cited By ~ 2

Author(s):

Keiichi Itoi ◽

Ikuko Motoike ◽

Ying Liu ◽

Sam Clokie ◽

Yasumasa Iwasaki ◽

...

Keyword(s):

Gene Expression ◽

Target Genes ◽

Receptor Gene ◽

Male Rats ◽

Regulatory Mechanisms ◽

High Dose ◽

Sequencing Analysis ◽

Reverse Transcription Pcr ◽

Genome Wide ◽

A Genome

Abstract Glucocorticoids (GCs) are essential for stress adaptation, acting centrally and in the periphery. Corticotropin-releasing factor (CRF), a major regulator of adrenal GC synthesis, is produced in the paraventricular nucleus of the hypothalamus (PVH), which contains multiple neuroendocrine and preautonomic neurons. GCs may be involved in diverse regulatory mechanisms in the PVH, but the target genes of GCs are largely unexplored except for the CRF gene (Crh), a well-known target for GC negative feedback. Using a genome-wide RNA-sequencing analysis, we identified transcripts that changed in response to either high-dose corticosterone (Cort) exposure for 12 days (12-day high Cort), corticoid deprivation for 7 days (7-day ADX), or acute Cort administration. Among others, canonical GC target genes were upregulated prominently by 12-day high Cort. Crh was upregulated or downregulated most prominently by either 7-day ADX or 12-day high Cort, emphasizing the recognized feedback effects of GC on the hypothalamic-pituitary-adrenal (HPA) axis. Concomitant changes in vasopressin and apelin receptor gene expression are likely to contribute to HPA repression. In keeping with the pleotropic cellular actions of GCs, 7-day ADX downregulated numerous genes of a broad functional spectrum. The transcriptome response signature differed markedly between acute Cort injection and 12-day high Cort. Remarkably, six immediate early genes were upregulated 1 hour after Cort injection, which was confirmed by quantitative reverse transcription PCR and semiquantitative in situ hybridization. This study may provide a useful database for studying the regulatory mechanisms of GC-dependent gene expression and repression in the PVH.

Download Full-text