DNAModAnnot: a R toolbox for DNA modification filtering and annotation

Bioinformatics ◽

10.1093/bioinformatics/btab032 ◽

2021 ◽

Author(s):

Alexis Hardy ◽

Mélody Matelot ◽

Amandine Touzeau ◽

Christophe Klopp ◽

Céline Lopez-Roques ◽

...

Keyword(s):

Global Analysis ◽

R Package ◽

Supplementary Information ◽

Dna Modification ◽

Paramecium Tetraurelia ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Dna Modifications ◽

Long Read

Abstract Motivation Long-read sequencing technologies can be employed to detect and map DNA modifications at the nucleotide resolution on a genome-wide scale. However, published software packages neglect the integration of genomic annotation and comprehensive filtering when analyzing patterns of modified bases detected using Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data. Here, we present DNAModAnnot, a R package designed for the global analysis of DNA modification patterns using adapted filtering and visualization tools. Results We tested our package using PacBio sequencing data to analyze patterns of the 6-methyladenine (6 mA) in the ciliate Paramecium tetraurelia, in which high 6 mA amounts were previously reported. We found Paramecium tetraurelia 6 mA genome-wide distribution to be similar to other ciliates. We also performed 5-methylcytosine (5mC) analysis in human lymphoblastoid cells using ONT data and confirmed previously known patterns of 5mC. DNAModAnnot provides a toolbox for the genome-wide analysis of different DNA modifications using PacBio and ONT long-read sequencing data. Availability DNAModAnnot is distributed as a R package available via GitHub (https://github.com/AlexisHardy/DNAModAnnot) Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Dense and accurate whole-chromosome haplotyping of individual genomes

10.1101/126136 ◽

2017 ◽

Cited By ~ 1

Author(s):

David Porubsky ◽

Shilpa Garg ◽

Ashley D. Sanders ◽

Jan O. Korbel ◽

Victor Guryev ◽

...

Keyword(s):

Target Genes ◽

Chromosome Length ◽

Single Individual ◽

Sequencing Data ◽

Individual Genome ◽

Sequencing Technologies ◽

Biological Phenomena ◽

Genome Wide ◽

A Genome ◽

Long Read

ABSTRACTThe diploid nature of the genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. Many important biological phenomena such as compound heterozygosity and epistatic effects between enhancers and target genes, however, can only be studied when haplotype-resolved genomes are available. This lack of haplotype-level analyses can be explained by a dearth of methods to produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. Our experiments provide comprehensive guidance on favorable combinations of Strand-seq libraries and sequencing coverages to obtain complete and genome-wide haplotypes of a single individual genome (NA12878) at manageable costs. We were able to reliably assign > 95% of alleles to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different sequencing technologies represents an attractive solution to chart the unique genetic variation of diploid genomes.

Download Full-text

Eagle: multi-locus association mapping on a genome-wide scale made routine

Bioinformatics ◽

10.1093/bioinformatics/btz759 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1509-1516

Author(s):

Andrew W George ◽

Arunas Verbyla ◽

Joshua Bowden

Keyword(s):

Association Mapping ◽

Multiple Testing ◽

R Package ◽

Single Locus ◽

Supplementary Information ◽

Locus Method ◽

Genome Wide ◽

A Genome ◽

Wide Scale ◽

Mouse Study

Abstract Motivation We present Eagle, a new method for multi-locus association mapping. The motivation for developing Eagle was to make multi-locus association mapping ‘easy’ and the method-of-choice. Eagle’s strengths are that it (i) is considerably more powerful than single-locus association mapping, (ii) does not suffer from multiple testing issues, (iii) gives results that are immediately interpretable and (iv) has a computational footprint comparable to single-locus association mapping. Results By conducting a large simulation study, we will show that Eagle finds true and avoids false single-nucleotide polymorphism trait associations better than competing single- and multi-locus methods. We also analyze data from a published mouse study. Eagle found over 50% more validated findings than the state-of-the-art single-locus method. Availability and implementation Eagle has been implemented as an R package, with a browser-based Graphical User Interface for users less familiar with R. It is freely available via the CRAN website at https://cran.r-project.org. Videos, Quick Start guides, FAQs and Demos are available via the Eagle website http://eagle.r-forge.r-project.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population

BMC Medical Genomics ◽

10.1186/s12920-020-00853-3 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C. Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using . Length variation of known disease-associated repeats was compared to other repeat loci. Results We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5′UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Genome-wide Survey of Tandem Repeats by Nanopore Sequencing Shows that Disease-associated Repeats are More Polymorphic in the General Population

10.21203/rs.3.rs-79348/v1 ◽

2020 ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Repeat Unit ◽

Mendelian Disease ◽

Length Variation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

Abstract Background: Tandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. It is difficult to predict which tandem repeats may cause a disease. One hypothesis is that changeable tandem repeats are the source of genetic diseases, because disease-causing repeats are polymorphic in healthy individuals. However, it is not clear whether disease-causing repeats are more polymorphic than other repeats. Methods: We performed a genome-wide survey of the millions of human tandem repeats using publicly available long read genome sequencing data from 21 humans. We measured tandem repeat copy number changes using tandem-genotypes. Length variation of known disease-associated repeats was compared to other repeat loci. Results: We found that known Mendelian disease-causing or disease-associated repeats, especially CAG and 5'UTR GGC repeats, are relatively long and polymorphic in the general population. We also show that repeat lengths of two disease-causing tandem repeats, in ATXN3 and GLS, are correlated with near-by GWAS SNP genotypes. Conclusions: We provide a catalog of polymorphic tandem repeats across a variety of repeat unit lengths and sequences, from long read sequencing data. This method especially if used in genome wide association study (GWAS), may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population

10.1101/2019.12.19.883389 ◽

2019 ◽

Author(s):

Satomi Mitsuhashi ◽

Martin C Frith ◽

Naomichi Matsumoto

Keyword(s):

General Population ◽

Tandem Repeats ◽

Mendelian Disease ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Human Genomes ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Wide Survey

AbstractTandem repeats are highly mutable and contribute to the development of human disease by a variety of mechanisms. However, it is difficult to predict which tandem repeats may cause a disease. We performed a genome-wide survey of the millions of human tandem repeats using long read genome sequencing data from 16 humans. We found that known Mendelian disease-causing or disease-associated repeats, especially coding CAG and 5’UTR GGC repeats, are relatively long and polymorphic in the general population. This method, especially if used in GWAS, may indicate possible new candidates of pathogenic or biologically important tandem repeats in human genomes.

Download Full-text

Quinoa genome assembly employing genomic variation for guided scaffolding

Theoretical and Applied Genetics ◽

10.1007/s00122-021-03915-x ◽

2021 ◽

Author(s):

Alexandrina Bodrug-Schepers ◽

Nancy Stralis-Pavese ◽

Hermann Buerstmayr ◽

Juliane C. Dohm ◽

Heinz Himmelbauer

Keyword(s):

Genome Assembly ◽

Chenopodium Quinoa ◽

Genomic Variation ◽

Valuable Resource ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

Haplotype Information

Abstract Key message We propose to use the natural variation between individuals of a population for genome assembly scaffolding. In today’s genome projects, multiple accessions get sequenced, leading to variant catalogs. Using such information to improve genome assemblies is attractive both cost-wise as well as scientifically, because the value of an assembly increases with its contiguity. We conclude that haplotype information is a valuable resource to group and order contigs toward the generation of pseudomolecules. Abstract Quinoa (Chenopodium quinoa) has been under cultivation in Latin America for more than 7500 years. Recently, quinoa has gained increasing attention due to its stress resistance and its nutritional value. We generated a novel quinoa genome assembly for the Bolivian accession CHEN125 using PacBio long-read sequencing data (assembly size 1.32 Gbp, initial N50 size 608 kbp). Next, we re-sequenced 50 quinoa accessions from Peru and Bolivia. This set of accessions differed at 4.4 million single-nucleotide variant (SNV) positions compared to CHEN125 (1.4 million SNV positions on average per accession). We show how to exploit variation in accessions that are distantly related to establish a genome-wide ordered set of contigs for guided scaffolding of a reference assembly. The method is based on detecting shared haplotypes and their expected continuity throughout the genome (i.e., the effect of linkage disequilibrium), as an extension of what is expected in mapping populations where only a few haplotypes are present. We test the approach using Arabidopsis thaliana data from different populations. After applying the method on our CHEN125 quinoa assembly we validated the results with mate-pairs, genetic markers, and another quinoa assembly originating from a Chilean cultivar. We show consistency between these information sources and the haplotype-based relations as determined by us and obtain an improved assembly with an N50 size of 1079 kbp and ordered contig groups of up to 39.7 Mbp. We conclude that haplotype information in distantly related individuals of the same species is a valuable resource to group and order contigs according to their adjacency in the genome toward the generation of pseudomolecules.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

bGWAS: an R package to perform Bayesian genome wide association studies

Bioinformatics ◽

10.1093/bioinformatics/btaa549 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4374-4376

Author(s):

Ninon Mounier ◽

Zoltán Kutalik

Keyword(s):

Mendelian Randomization ◽

Causal Effect ◽

Association Studies ◽

R Package ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Biological Mechanisms ◽

Genome Wide ◽

Related Risk

Abstract Summary Increasing sample size is not the only strategy to improve discovery in Genome Wide Association Studies (GWASs) and we propose here an approach that leverages published studies of related traits to improve inference. Our Bayesian GWAS method derives informative prior effects by leveraging GWASs of related risk factors and their causal effect estimates on the focal trait using multivariable Mendelian randomization. These prior effects are combined with the observed effects to yield Bayes Factors, posterior and direct effects. The approach not only increases power, but also has the potential to dissect direct and indirect biological mechanisms. Availability and implementation bGWAS package is freely available under a GPL-2 License, and can be accessed, alongside with user guides and tutorials, from https://github.com/n-mounier/bGWAS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detection and application of genome-wide variations in peach for association and genetic relationship analysis

BMC Genetics ◽

10.1186/s12863-019-0799-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Liping Guan ◽

Ke Cao ◽

Yong Li ◽

Jian Guo ◽

Qiang Xu ◽

...

Keyword(s):

Genetic Relationship ◽

Dna Markers ◽

High Throughput Sequencing ◽

Prunus Persica ◽

Genetic Research ◽

Diploid Species ◽

Sequencing Data ◽

Relationship Analysis ◽

Genome Wide ◽

A Genome

Abstract Background Peach (Prunus persica L.) is a diploid species and model plant of the Rosaceae family. In the past decade, significant progress has been made in peach genetic research via DNA markers, but the number of these markers remains limited. Results In this study, we performed a genome-wide DNA markers detection based on sequencing data of six distantly related peach accessions. A total of 650,693~1,053,547 single nucleotide polymorphisms (SNPs), 114,227~178,968 small insertion/deletions (InDels), 8386~12,298 structure variants (SVs), 2111~2581 copy number variants (CNVs) and 229,357~346,940 simple sequence repeats (SSRs) were detected and annotated. To demonstrate the application of DNA markers, 944 SNPs were filtered for association study of fruit ripening time and 15 highly polymorphic SSRs were selected to analyze the genetic relationship among 221 accessions. Conclusions The results showed that the use of high-throughput sequencing to develop DNA markers is fast and effective. Comprehensive identification of DNA markers, including SVs and SSRs, would be of benefit to genetic diversity evaluation, genetic mapping, and molecular breeding of peach.

Download Full-text

Patterns of genome-wide allele-specific expression in hybrid rice and the implications on the genetic basis of heterosis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1820513116 ◽

2019 ◽

Vol 116 (12) ◽

pp. 5653-5658 ◽

Cited By ~ 15

Author(s):

Lin Shao ◽

Feng Xing ◽

Conghao Xu ◽

Qinghua Zhang ◽

Jian Che ◽

...

Keyword(s):

Genetic Basis ◽

Molecular Mechanisms ◽

Sequencing Data ◽

Specific Expression ◽

Allele Specific Expression ◽

Parental Allele ◽

Genetic Components ◽

Genome Wide ◽

A Genome ◽

Allele Specific

Utilization of heterosis has greatly increased the productivity of many crops worldwide. Although tremendous progress has been made in characterizing the genetic basis of heterosis using genomic technologies, molecular mechanisms underlying the genetic components are much less understood. Allele-specific expression (ASE), or imbalance between the expression levels of two parental alleles in the hybrid, has been suggested as a mechanism of heterosis. Here, we performed a genome-wide analysis of ASE by comparing the read ratios of the parental alleles in RNA-sequencing data of an elite rice hybrid and its parents using three tissues from plants grown under four conditions. The analysis identified a total of 3,270 genes showing ASE (ASEGs) in various ways, which can be classified into two patterns: consistent ASEGs such that the ASE was biased toward one parental allele in all tissues/conditions, and inconsistent ASEGs such that ASE was found in some but not all tissues/conditions, including direction-shifting ASEGs in which the ASE was biased toward one parental allele in some tissues/conditions while toward the other parental allele in other tissues/conditions. The results suggested that these patterns may have distinct implications in the genetic basis of heterosis: The consistent ASEGs may cause partial to full dominance effects on the traits that they regulate, and direction-shifting ASEGs may cause overdominance. We also showed that ASEGs were significantly enriched in genomic regions that were differentially selected during rice breeding. These ASEGs provide an index of the genes for future pursuit of the genetic and molecular mechanism of heterosis.

Download Full-text