An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008949 ◽

2021 ◽

Vol 17 (9) ◽

pp. e1008949

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven H. Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Relative Abundances ◽

Pooled Samples

A current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian inference model to estimate the phylogeny of the haplotypes and their relative abundances, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and relative abundances of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

Detecting Inversions with PCA in the Presence of Population Structure

10.1101/736900 ◽

2019 ◽

Author(s):

Ronald J. Nowling ◽

Krystal R. Manke ◽

Scott J. Emrich

Keyword(s):

Simulated Data ◽

Principal Component ◽

Real Data ◽

Malaria Vectors ◽

Anopheles Coluzzii ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Closely Related Species ◽

Proper Analysis

ABSTRACTChromosomal inversions are associated with reproductive isolation and adaptation in insects such as Drosophila melanogaster and the malaria vectors Anopheles gambiae and Anopheles coluzzii. While methods based on read alignment have been useful in humans for detecting inversions, these methods are less successful in insects due to long repeated sequences at the breakpoints. Alternatively, inversions can be detected using principal component analysis (PCA) of single nucleotide polymorphisms (SNPs). We apply PCA-based inversion detection to a simulated data set and real data from multiple insect species, which vary in complexity from a single inversion in samples drawn from a single population to analyzing multiple overlapping inversions occurring in closely-related species, samples of which that were generated from multiple geographic locations. We show empirically that proper analysis of these data can be challenging when multiple inversions or populations are present, and that our alternative framework is more robust in these more difficult scenarios.

Download Full-text

Heteroalleles in Common Wheat: Multiple Differences between Allelic Variants of the Gli-B1 Locus

International Journal of Molecular Sciences ◽

10.3390/ijms22041832 ◽

2021 ◽

Vol 22 (4) ◽

pp. 1832

Author(s):

Eugene Metakovsky ◽

Laura Pascual ◽

Patrizia Vaccino ◽

Viktor Melnik ◽

Marta Rodriguez-Quijano ◽

...

Keyword(s):

Common Wheat ◽

Dna Sequences ◽

Fragment Length Polymorphism ◽

Snp Markers ◽

Group Iv ◽

Nucleotide Polymorphisms ◽

High Genetic Diversity ◽

Single Nucleotide ◽

Allelic Variants ◽

B Genome

The Gli-B1-encoded γ-gliadins and non-coding γ-gliadin DNA sequences for 15 different alleles of common wheat have been compared using seven tests: electrophoretic mobility (EM) and molecular weight (MW) of the encoded major γ-gliadin, restriction fragment length polymorphism patterns (RFLPs) (three different markers), Gli-B1-γ-gliadin-pseudogene known SNP markers (Single nucleotide polymorphisms) and sequencing the pseudogene GAG56B. It was discovered that encoded γ-gliadins, with contrasting EM, had similar MWs. However, seven allelic variants (designated from I to VII) differed among them in the other six tests: I (alleles Gli-B1i, k, m, o), II (Gli-B1n, q, s), III (Gli-B1b), IV (Gli-B1e, f, g), V (Gli-B1h), VI (Gli-B1d) and VII (Gli-B1a). Allele Gli-B1c (variant VIII) was identical to the alleles from group IV in four of the tests. Some tests might show a fine difference between alleles belonging to the same variant. Our results attest in favor of the independent origin of at least seven variants at the Gli-B1 locus that might originate from deeply diverged genotypes of the donor(s) of the B genome in hexaploid wheat and therefore might be called “heteroallelic”. The donor’s particularities at the Gli-B1 locus might be conserved since that time and decisively contribute to the current high genetic diversity of common wheat.

Download Full-text

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Download Full-text

Overcoming limitations to environmental DNA studies: A coastal temperate reference sequence database for multiple chloroplast gene regions generated in a single assay.

10.22541/au.163252330.05592688/v1 ◽

2021 ◽

Author(s):

Nicole Foster ◽

Kor-jent Dijk ◽

Ed Biffin ◽

Jennifer Young ◽

Vicki Thomson ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Environmental Dna ◽

Reference Sequence ◽

Reference Database ◽

Chloroplast Gene ◽

Coastal Plants ◽

Reference Databases ◽

Targeted Capture ◽

Comprehensive Reference

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.

Download Full-text

Genome-wide analyses reveal clustering in Cannabis cultivars: the ancient domestication trilogy of a panacea

10.7287/peerj.preprints.1553v2 ◽

2015 ◽

Cited By ~ 3

Author(s):

Philippe Henry

Keyword(s):

Genetic Variation ◽

Genetic Structure ◽

Open Access ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Genome Wide ◽

Access Data ◽

Diagnostic Snps ◽

Shed Light

In the present research, I used an open access data set (Medicinal Genomics) consisting of nearly 200'000 genome-wide single nucleotide polymorphisms (SNPs) typed in 28 cannabis accessions to shed light on the plant's underlying genetic structure. Genome-wide loadings were used to sequentially cull less informative markers. The process involved reducing the number of SNPs to 100K, 10K, 1K, 100 until I identified a set of 42 highly informative SNPs that I present here. The two first principal components, encompass over 3/4 of the genetic variation present in the dataset (PCA1 = 48.6%, PCA2= 26.3%). This set of diagnostic SNPs is then used to identify clusters into which cannabis accession segregate. I identified three clear and consistent clusters; reflective of the ancient domestication trilogy of the genus Cannabis.

Download Full-text

Database establishment for the secondary fungal DNA barcodetranslational elongation factor 1α(TEF1α)

Genome ◽

10.1139/gen-2018-0083 ◽

2019 ◽

Vol 62 (3) ◽

pp. 160-169 ◽

Cited By ~ 8

Author(s):

Wieland Meyer ◽

Laszlo Irinyi ◽

Minh Thuy Vi Hoang ◽

Vincent Robert ◽

Dea Garcia-Hermoso ◽

...

Keyword(s):

Dna Sequences ◽

Fungal Infections ◽

Dna Barcode ◽

Elongation Factor ◽

Its Region ◽

Reference Sequence ◽

Diagnostic Tools ◽

Reference Database ◽

Elongation Factor 1Α ◽

Fungal Dna

With new or emerging fungal infections, human and animal fungal pathogens are a growing threat worldwide. Current diagnostic tools are slow, non-specific at the species and subspecies levels, and require specific morphological expertise to accurately identify pathogens from pure cultures. DNA barcodes are easily amplified, universal, short species-specific DNA sequences, which enable rapid identification by comparison with a well-curated reference sequence collection. The primary fungal DNA barcode, ITS region, was introduced in 2012 and is now routinely used in diagnostic laboratories. However, the ITS region only accurately identifies around 75% of all medically relevant fungal species, which has prompted the development of a secondary barcode to increase the resolution power and suitability of DNA barcoding for fungal disease diagnostics. The translational elongation factor 1α (TEF1α) was selected in 2015 as a secondary fungal DNA barcode, but it has not been implemented into practice, due to the absence of a reference database. Here, we have established a quality-controlled reference database for the secondary barcode that together with the ISHAM-ITS database, forms the ISHAM barcode database, available online at http://its.mycologylab.org/ . We encourage the mycology community for active contributions.

Download Full-text

The covariance of heterozygosity as a measure of linkage disequilibrium between blocks of linked and unlinked sites in Hapmap

Genetics Research ◽

10.1017/s0016672311000255 ◽

2011 ◽

Vol 93 (4) ◽

pp. 285-290 ◽

Cited By ~ 2

Author(s):

JOHN A. SVED

Keyword(s):

Linkage Disequilibrium ◽

Random Mating ◽

Block Size ◽

Nucleotide Polymorphisms ◽

Single Population ◽

Data Set ◽

Single Nucleotide ◽

Sample Size Calculations ◽

Measure Of Association ◽

Average Measure

SummaryThe covariance of heterozygosity serves as a measure of linkage disequilibrium (LD) between genes at two loci, although one that does not have as much information as a parameter such as r2. However, it may be extended to blocks of loci (single nucleotide polymorphisms, SNPs) along a chromosome. This has two advantages when searching for significant associations between different chromosomal regions. Calculations for a data set such as Hapmap are complicated by the large number of pairs of loci (SNPs) that need to be considered. For example, a search for significant associations between SNPs on different chromosomes involves around 1012 calculations for a single population. Furthermore, this may not be an efficient way of detecting associations since r2 values calculated from neighbouring pairs will not be independent of each other. The covariance of heterozygosity provides an average measure of association between blocks of any size, and reduces the number of calculations by a factor of b2, where b is the block size. Unlike the calculation of r2, the covariance of heterozygosity uses just diploid data and is not biased by sample size. Calculations using a block size of 50 have been used to look for associations in the Hapmap data set between regions within and between chromosomes. Within chromosomes, a signal is detected up to around 10 cM. No obviously significant associations have been detected between regions on different chromosomes, although there is a low level of association consistent with departures from random mating.

Download Full-text

Accuracy of marker-assisted selection with single markers and marker haplotypes in cattle

Genetics Research ◽

10.1017/s0016672307008865 ◽

2007 ◽

Vol 89 (4) ◽

pp. 215-220 ◽

Cited By ~ 49

Author(s):

B. J. HAYES ◽

A. J. CHAMBERLAIN ◽

H. McPARTLAN ◽

I. MACLEOD ◽

L. SETHURAMAN ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Quantitative Trait Loci ◽

Single Nucleotide Polymorphisms ◽

Quantitative Trait ◽

Marker Assisted Selection ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Angus Cattle ◽

Genome Wide

SummaryA key question for the implementation of marker-assisted selection (MAS) using markers in linkage disequilibrium with quantitative trait loci (QTLs) is how many markers surrounding each QTL should be used to ensure the marker or marker haplotypes are in sufficient linkage disequilibrium (LD) with the QTL. In this paper we compare the accuracy of MAS using either single markers or marker haplotypes in an Angus cattle data set consisting of 9323 genome-wide single nucleotide polymorphisms (SNPs) genotyped in 379 Angus cattle. The extent of LD in the data set was such that the average marker–marker r2 was 0·2 at 200 kb. The accuracy of MAS increased as the number of markers in the haplotype surrounding the QTL increased, although only when the number of markers in the haplotype was 4 or greater did the accuracy exceed that achieved when the SNP in the highest LD with the QTL was used. A large number of phenotypic records (>1000) were required to accurately estimate the effects of the haplotypes.

Download Full-text

Construction of relatedness matrices using genotyping-by-sequencing data

10.1101/025379 ◽

2015 ◽

Cited By ~ 2

Author(s):

Ken G Dodds ◽

John C McEwan ◽

Rudiger Brauning ◽

Rayna M Anderson ◽

Tracey C van Stijn ◽

...

Keyword(s):

Graphical Method ◽

Genotyping By Sequencing ◽

Real Data ◽

Sequencing Depth ◽

Attractive Alternative ◽

Nucleotide Polymorphisms ◽

Efficient Computation ◽

Sequencing Data ◽

Data Set ◽

Unbiased Estimates

Background Genotyping-by-sequencing (GBS) is becoming an attractive alternative to array-based methods for genotyping individuals for a large number of single nucleotide polymorphisms (SNPs). Costs can be lowered by reducing the mean sequencing depth, but this results in genotype calls of lower quality. A common analysis strategy is to filter SNPs to just those with sufficient depth, thereby greatly reducing the number of SNPs available. We investigate methods for estimating relatedness using GBS data, including results of low depth, using theoretical calculation, simulation and application to a real data set. Results We show that unbiased estimates of relatedness can be obtained by using only those SNPs with genotype calls in both individuals. The expected value of this estimator is independent of the SNP depth in each individual, under a model of genotype calling that includes the special case of the two alleles being read at random. In contrast, the estimator of self-relatedness does depend on the SNP depth, and we provide a modification to provide unbiased estimates of self-relatedness. We refer to these methods of estimation as kinship using GBS with depth adjustment (KGD). The estimators can be calculated using matrix methods, which allow efficient computation. Simulation results were consistent with the methods being unbiased, and suggest that the optimal sequencing depth is around 2-4 for relatedness between individuals and 5-10 for self-relatedness. Application to a real data set revealed that some SNP filtering may still be necessary, for the exclusion of SNPs which did not behave in a Mendelian fashion. A simple graphical method (a ‘fin plot’) is given to illustrate this issue and to guide filtering parameters. Conclusion We provide a method which gives unbiased estimates of relatedness, based on SNPs assayed by GBS, which accounts for the depth (including zero depth) of the genotype calls. This allows GBS to be applied at read depths which can be chosen to optimise the information obtained. SNPs with excess heterozygosity, often due to (partial) polyploidy or other duplications can be filtered based on a simple graphical method.

Download Full-text