mutyper: assigning and summarizing mutation types for analyzing germline mutation spectra

Mapping Intimacies ◽

10.1101/2020.07.01.183392 ◽

2020 ◽

Author(s):

William S. DeWitt

Keyword(s):

Germline Mutation ◽

Population Genomics ◽

Mutation Spectrum ◽

Sequencing Data ◽

Genomic Context ◽

Frequency Spectra ◽

Link Type ◽

Snp Data ◽

Population Genetic Inference ◽

Mutation Spectra

AbstractSummaryCharacterization of germline mutation spectrum variation from population genomics data has shed light on the biological complexity of the mutation process, and its evolution within and between species. This analysis augments available population SNP data with estimates of local ancestral genomic context to assign mutation types and aggregate summary statistics thereof, and is increasingly common. There is a need for standardized computational tools to extract mutation spectrum information from sequencing data. Here I describe mutyper, a command-line utility and Python package that uses an ancestral genome estimate to assign mutation types to SNP data, compute mutation spectra for individuals, and compute sample frequency spectra resolved by mutation type for population genetic inference.Availability and implementationmutyper can be installed using the pip package manager and is compatible with Python 3.6+. Documentation is provided at https://harrispopgen.github.io/mutyper; source code is available at https://github.com/harrispopgen/mutyper.

Download Full-text

nPhase: An accurate and contiguous phasing method for polyploids

10.1101/2020.07.24.219105 ◽

2020 ◽

Cited By ~ 1

Author(s):

Omar Abou Saada ◽

Andreas Tsouris ◽

Anne Friedrich ◽

Joseph Schacherer

Keyword(s):

Saccharomyces Cerevisiae ◽

Genome Sequencing ◽

Population Genomics ◽

Genomic Data ◽

Reference Alignment ◽

Sequencing Data ◽

Model Species ◽

Short Reads ◽

Link Type ◽

Long Reads

AbstractWhile genome sequencing and assembly are now routine, we still do not have a full and precise picture of polyploid genomes. Phasing these genomes, i.e. deducing haplotypes from genomic data, remains a challenge. Despite numerous attempts, no existing polyploid phasing method provides accurate and contiguous haplotype predictions. To address this need, we developed nPhase, a ploidy agnostic pipeline and algorithm that leverage the accuracy of short reads and the length of long reads to solve reference alignment-based phasing for samples of unspecified ploidy (https://github.com/nPhasePipeline/nPhase). nPhase was validated on virtually constructed polyploid genomes of the model species Saccharomyces cerevisiae, generated by combining sequencing data of homozygous isolates. nPhase obtained on average >95% accuracy and a contiguous 1.25 haplotigs per haplotype to cover >90% of each chromosome (heterozygosity rate ≥0.5%). This new phasing method opens the door to explore polyploid genomes through applications such as population genomics and hybrid studies.

Download Full-text

nPhase: an accurate and contiguous phasing method for polyploids

Genome Biology ◽

10.1186/s13059-021-02342-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Omar Abou Saada ◽

Andreas Tsouris ◽

Chris Eberlein ◽

Anne Friedrich ◽

Joseph Schacherer

Keyword(s):

Genome Sequencing ◽

Population Genomics ◽

Short Reads ◽

Link Type ◽

Long Reads

AbstractWhile genome sequencing and assembly are now routine, we do not have a full, precise picture of polyploid genomes. No existing polyploid phasing method provides accurate and contiguous haplotype predictions. We developed nPhase, a ploidy agnostic tool that leverages long reads and accurate short reads to solve alignment-based phasing for samples of unspecified ploidy (https://github.com/OmarOakheart/nPhase). nPhase is validated by tests on simulated and real polyploids. nPhase obtains on average over 95% accuracy and a contiguous 1.25 haplotigs per haplotype to cover more than 90% of each chromosome (heterozygosity rate ≥ 0.5%). nPhase allows population genomics and hybrid studies of polyploids.

Download Full-text

Population Genomics of American Mink Using Whole Genome Sequencing Data

Genes ◽

10.3390/genes12020258 ◽

2021 ◽

Vol 12 (2) ◽

pp. 258

Author(s):

Karim Karimi ◽

Duy Ngoc Do ◽

Mehdi Sargolzaei ◽

Younes Miar

Keyword(s):

Population Genomics ◽

Association Studies ◽

American Mink ◽

Population History ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Effective Population ◽

Cross Validation Error

Characterizing the genetic structure and population history can facilitate the development of genomic breeding strategies for the American mink. In this study, we used the whole genome sequences of 100 mink from the Canadian Centre for Fur Animal Research (CCFAR) at the Dalhousie Faculty of Agriculture (Truro, NS, Canada) and Millbank Fur Farm (Rockwood, ON, Canada) to investigate their population structure, genetic diversity and linkage disequilibrium (LD) patterns. Analysis of molecular variance (AMOVA) indicated that the variation among color-types was significant (p < 0.001) and accounted for 18% of the total variation. The admixture analysis revealed that assuming three ancestral populations (K = 3) provided the lowest cross-validation error (0.49). The effective population size (Ne) at five generations ago was estimated to be 99 and 50 for CCFAR and Millbank Fur Farm, respectively. The LD patterns revealed that the average r2 reduced to <0.2 at genomic distances of >20 kb and >100 kb in CCFAR and Millbank Fur Farm suggesting that the density of 120,000 and 24,000 single nucleotide polymorphisms (SNP) would provide the adequate accuracy of genomic evaluation in these populations, respectively. These results indicated that accounting for admixture is critical for designing the SNP panels for genotype-phenotype association studies of American mink.

Download Full-text

Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btq343 ◽

2010 ◽

Vol 26 (17) ◽

pp. 2101-2108 ◽

Cited By ~ 27

Author(s):

Jiří Macas ◽

Pavel Neumann ◽

Petr Novák ◽

Jiming Jiang

Keyword(s):

Large Scale ◽

Rice Genome ◽

Supplementary Information ◽

Sequencing Data ◽

Satellite Repeat ◽

Frequency Spectra ◽

Consensus Sequences ◽

Chip Sequencing ◽

Conserved Sequence ◽

Centromeric Satellite

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Complete mitochondrial genome sequence of Labriocimbex sinicus, a new genus and new species of Cimbicidae (Hymenoptera) from China

PeerJ ◽

10.7717/peerj.7853 ◽

2019 ◽

Vol 7 ◽

pp. e7853 ◽

Cited By ~ 1

Author(s):

Yuchen Yan ◽

Gengyun Niu ◽

Yaoyao Zhang ◽

Qianying Ren ◽

Shiyu Du ◽

...

Keyword(s):

Mitochondrial Genome ◽

New Genus ◽

High Throughput Sequencing ◽

Phylogenetic Analyses ◽

Complete Mitochondrial Genome ◽

Sister Group ◽

Morphological Characters ◽

Trna Genes ◽

Sequencing Data ◽

Link Type

Labriocimbex sinicus Yan & Wei gen. et sp. nov. of Cimbicidae is described. The new genus is similar to Praia Andre and Trichiosoma Leach. A key to extant Holarctic genera of Cimbicinae is provided. To identify the phylogenetic placement of Cimbicidae, the mitochondrial genome of L. sinicus was annotated and characterized using high-throughput sequencing data. The complete mitochondrial genome of L. sinicus was obtained with a length of 15,405 bp (GenBank: MH136623; SRA: SRR8270383) and a typical set of 37 genes (22 tRNAs, 13 PCGs, and two rRNAs). The results demonstrated that all PCGs were initiated by ATN codon, and ended with TAA or T stop codons. The study reveals that all tRNA genes have a typical clover-leaf secondary structure, except for trnS1. Remarkably, the secondary structures of the rrnS and rrnL of L. sinicus were much different from those of Corynis lateralis. Phylogenetic analyses verified the monophyly and positions of the three Cimbicidae species within the superfamily Tenthredinoidea and demonstrated a relationship as (Tenthredinidae + Cimbicidae) + (Argidae + Pergidae) with strong nodal supports. Furthermore, we found that the generic relationships of Cimbicidae revealed by the phylogenetic analyses based on COI genes agree quite closely with the systematic arrangement of the genera based on the morphological characters. Phylogenetic tree based on two methods shows that L. sinicus is the sister group of Praia with high support values. We suggest that Labriocimbex belongs to the tribe Trichiosomini of Cimbicinae based on adult morphology and molecular data. Besides, we suggest to promote the subgenus Asitrichiosoma to be a valid genus.

Download Full-text

MHC*IMP – Imputation of Alleles for Genes in the Major Histocompatibility Complex

10.1101/2020.01.24.919191 ◽

2020 ◽

Author(s):

David McG. Squire ◽

Allan Motyer ◽

Richard Ahn ◽

Joanne Nititham ◽

Zhi-Ming Huang ◽

...

Keyword(s):

Major Histocompatibility Complex ◽

Prediction Accuracy ◽

Cross Validation ◽

Whole Genome Sequencing Data ◽

Major Histocompatibility ◽

Sequencing Data ◽

Imputation Model ◽

Human Major Histocompatibility Complex ◽

Histocompatibility Complex ◽

Snp Data

AbstractWe report the development of MHC*IMP, a method for imputing non-classical HLA and other genes in the human Major Histocompatibility Complex (MHC). We created a reference panel for 25 genes in the MHC using allele calls from Whole Genome Sequencing data, combined with SNP data for the same individuals. We used this to construct an allele imputation model, MHC*IMP, for each gene. Cross-validation showed that MHC*IMP performs very well, with allele prediction accuracy 93% or greater for all but two of the genes, and greater than 95% for all but four.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples

10.1101/097881 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christopher Wilks ◽

Phani Gaddipati ◽

Abhinav Nellore ◽

Ben Langmead

Keyword(s):

Tissue Specificity ◽

Rna Seq ◽

Sequencing Data ◽

Transcription Start ◽

Link Type ◽

Alternative Transcription ◽

Web App ◽

Inverted Indexing ◽

Splice Junctions ◽

Splicing Patterns

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.

Download Full-text

Ancient introgression between distantly related white oaks (Quercus sect Quercus) shows evidence of climate-associated asymmetric gene exchange

Journal of Heredity ◽

10.1093/jhered/esab053 ◽

2021 ◽

Author(s):

Scott T O’Donnell ◽

Sorel T Fitz-Gibbon ◽

Victoria L Sork

Keyword(s):

Gene Flow ◽

Genotyping By Sequencing ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Scrub Oak ◽

Population Genetic Inference ◽

Genetic Inference ◽

California Floristic Province ◽

Python Package

Abstract Ancient introgression can be an important source of genetic variation that shapes the evolution and diversification of many taxa. Here, we estimate the timing, direction and extent of gene flow between two distantly related oak species in the same section (Quercus sect. Quercus). We estimated these demographic events using genotyping by sequencing data (GBS), which generated 25,702 single nucleotide polymorphisms (SNPs) for 24 individuals of California scrub oak (Quercus berberidifolia) and 23 individuals of Engelmann oak (Q. engelmannii). We tested several scenarios involving gene flow between these species using the diffusion approximation-based population genetic inference framework and model-testing approach of the Python package DaDi. We found that the most likely demographic scenario includes a bottleneck in Q. engelmannii that coincides with asymmetric gene flow from Q. berberidifolia into Q. engelmannii. Given that the timing of this gene flow coincides with the advent of a Mediterranean-type climate in the California Floristic Province, we propose that changing precipitation patterns and seasonality may have favored the introgression of climate-associated genes from the endemic into the non-endemic California oak.

Download Full-text