A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes

AbstractGiven the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for accurate genotyping methodology that distinguishes paralogs in order to yield Mendelian markers. Methods such as comparing observed and expected heterozygosity are frequently used for identifying collapsed paralogs, but have limitations in genotyping-by-sequencing datasets, in which observed heterozygosity is difficult to estimate due to undersampling of alleles. These limitations are especially pronounced when the species is highly heterozygous or the expected inheritance is polysomic. We introduce a novel statistic, Hind/HE, that uses the probability of sampling reads of two different alleles at a sample*locus, instead of observed heterozygosity. The expected value of Hind/HE is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. We also introduce an algorithm that can choose among multiple alignment locations for a given sequence tag in order to optimize the value of Hind/HE for each locus, correcting alignment errors that frequently occur in highly duplicated genomes. Our methodology is implemented in polyRAD v1.2, available at https://github.com/lvclark/polyRAD.

Download Full-text

polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids

10.1101/380899 ◽

2018 ◽

Cited By ~ 4

Author(s):

Lindsay V. Clark ◽

Alexander E. Lipka ◽

Erik J. Sacks

Keyword(s):

Genotyping By Sequencing ◽

Read Depth ◽

R Package ◽

Structured Populations ◽

Sequencing Data ◽

Genotype Calling ◽

Genome Wide ◽

Data Rates ◽

Hardy Weinberg Equilibrium ◽

Mapping Populations

AbstractLow or uneven read depth is a common limitation of genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), resulting in high missing data rates, heterozygotes miscalled as homozygotes, and uncertainty of allele copy number in heterozygous polyploids. Bayesian genotype calling can mitigate these issues, but previously has only been implemented in software that requires a reference genome or uses priors that may be inappropriate for the population. Here we present several novel Bayesian algorithms that estimate genotype posterior probabilities, all of which are implemented in a new R package, polyRAD. Appropriate priors can be specified for mapping populations, populations in Hardy-Weinberg equilibrium, or structured populations, and in each case can be informed by genotypes at linked markers. The polyRAD software imports read depth from several existing pipelines, and outputs continuous or discrete numerical genotypes suitable for analyses such as genome-wide association and genomic prediction.

Download Full-text

Characterisation of microsatellite and SNP markers from Miseq and genotyping-by-sequencing data among parapatric Urophora cardui (Tephritidae) populations

PeerJ ◽

10.7717/peerj.3582 ◽

2017 ◽

Vol 5 ◽

pp. e3582 ◽

Cited By ~ 2

Author(s):

Jes Johannesen ◽

Armin G. Fabritzek ◽

Bettina Ebner ◽

Sven-Ernö Bikar

Keyword(s):

Genotyping By Sequencing ◽

Extraction Methods ◽

Snp Markers ◽

Small Scale ◽

Sequencing Data ◽

Allele Number ◽

Expected Heterozygosity ◽

Genome Wide ◽

Western Palearctic ◽

Dna Extraction Methods

Phylogeographic analyses of the gall fly Urophora cardui have in earlier studies based on allozymes and mtDNA identified small-scale, parapatrically diverged populations within an expanding Western Palearctic population. However, the low polymorphism of these markers prohibited an accurate delimitation of the evolutionary origin of the parapatric divergence. Urophora cardui from the Western Palearctic have been introduced into Canada as biological control agents of the host plant Cirsium arvense. Here, we characterise 12 microsatellite loci with hexa-, penta- and tetra-nucleotide repeat motifs and report a genotyping-by-sequencing SNP protocol. We test the markers for genetic variation among three parapatric U. cardui populations. Microsatellite variability (N = 59 individuals) was high: expected heterozygosity/locus/population (0.60–0.90), allele number/locus/population (5–21). One locus was alternatively sex-linked in males or females. Cross-species amplification in the sister species U. stylata was successful or partially successful for seven loci. For genotyping-by-sequencing (N = 18 individuals), different DNA extraction methods did not affect data quality. Depending on sequence sorting criteria, 1,177–2,347 unlinked SNPs and 1,750–4,469 parsimony informative sites were found in 3,514–5,767 loci recovered after paralog filtering. Both marker systems quantified the same population partitions with high probabilities. Many and highly differentiated loci in both marker systems indicate genome-wide diversification and genetically distinct populations.

Download Full-text

The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

Scientific Reports ◽

10.1038/srep19427 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 48

Author(s):

Davide Scaglione ◽

Sebastian Reyes-Chin-Wo ◽

Alberto Acquadro ◽

Lutz Froenicke ◽

Ezio Portis ◽

...

Keyword(s):

Genome Sequence ◽

De Novo ◽

Genotyping By Sequencing ◽

Read Depth ◽

Globe Artichoke ◽

Sequencing Data ◽

Crop Species ◽

Low Pass ◽

Sequencing Strategy ◽

Low Coverage

Abstract Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds and wild species within and beyond the Compositae and will facilitate the identification of economically important genes from related species.

Download Full-text

Parentage Analysis in Giant Grouper (Epinephelus lanceolatus) Using Microsatellite and SNP Markers from Genotyping-by-Sequencing Data

Genes ◽

10.3390/genes12071042 ◽

2021 ◽

Vol 12 (7) ◽

pp. 1042

Author(s):

Zhuoying Weng ◽

Yang Yang ◽

Xi Wang ◽

Lina Wu ◽

Sijie Hua ◽

...

Keyword(s):

Fishery Management ◽

Genotyping By Sequencing ◽

Parentage Analysis ◽

Snp Markers ◽

Individual Identification ◽

Pedigree Information ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Polymorphic Snps ◽

Mixed Family

Pedigree information is necessary for the maintenance of diversity for wild and captive populations. Accurate pedigree is determined by molecular marker-based parentage analysis, which may be influenced by the polymorphism and number of markers, integrity of samples, relatedness of parents, or different analysis programs. Here, we described the first development of 208 single nucleotide polymorphisms (SNPs) and 11 microsatellites for giant grouper (Epinephelus lanceolatus) taking advantage of Genotyping-by-sequencing (GBS), and compared the power of SNPs and microsatellites for parentage and relatedness analysis, based on a mixed family composed of 4 candidate females, 4 candidate males and 289 offspring. CERVUS, PAPA and COLONY were used for mutually verification. We found that SNPs had a better potential for relatedness estimation, exclusion of non-parentage and individual identification than microsatellites, and > 98% accuracy of parentage assignment could be achieved by 100 polymorphic SNPs (MAF cut-off < 0.4) or 10 polymorphic microsatellites (mean Ho = 0.821, mean PIC = 0.651). This study provides a reference for the development of molecular markers for parentage analysis taking advantage of next-generation sequencing, and contributes to the molecular breeding, fishery management and population conservation.

Download Full-text

Chloroplast genomes in Populus (Salicaceae): comparisons from an intensively sampled genus reveal dynamic patterns of evolution

Scientific Reports ◽

10.1038/s41598-021-88160-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Jiawei Zhou ◽

Shuo Zhang ◽

Jie Wang ◽

Hongmei Shen ◽

Bin Ai ◽

...

Keyword(s):

Phylogenetic Analyses ◽

Population Level ◽

Gene Content ◽

Sequencing Data ◽

Cellular Functions ◽

Chloroplast Evolution ◽

Chloroplast Genomes ◽

Genome Features ◽

Genome Annotations ◽

Genome Analyses

AbstractThe chloroplast is one of two organelles containing a separate genome that codes for essential and distinct cellular functions such as photosynthesis. Given the importance of chloroplasts in plant metabolism, the genomic architecture and gene content have been strongly conserved through long periods of time and as such are useful molecular tools for evolutionary inferences. At present, complete chloroplast genomes from over 4000 species have been deposited into publicly accessible databases. Despite the large number of complete chloroplast genomes, comprehensive analyses regarding genome architecture and gene content have not been conducted for many lineages with complete species sampling. In this study, we employed the genus Populus to assess how more comprehensively sampled chloroplast genome analyses can be used in understanding chloroplast evolution in a broadly studied lineage of angiosperms. We conducted comparative analyses across Populus in order to elucidate variation in key genome features such as genome size, gene number, gene content, repeat type and number, SSR (Simple Sequence Repeat) abundance, and boundary positioning between the four main units of the genome. We found that some genome annotations were variable across the genus owing in part from errors in assembly or data checking and from this provided corrected annotations. We also employed complete chloroplast genomes for phylogenetic analyses including the dating of divergence times throughout the genus. Lastly, we utilized re-sequencing data to describe the variations of pan-chloroplast genomes at the population level for P. euphratica. The analyses used in this paper provide a blueprint for the types of analyses that can be conducted with publicly available chloroplast genomes as well as methods for building upon existing datasets to improve evolutionary inference.

Download Full-text

Single Nucleotide Polymorphism Discovery and Genetic Differentiation Analysis of Geese Bred in Poland, Using Genotyping-by-Sequencing (GBS)

Genes ◽

10.3390/genes12071074 ◽

2021 ◽

Vol 12 (7) ◽

pp. 1074

Author(s):

Joanna Grzegorczyk ◽

Artur Gurgul ◽

Maria Oczkowicz ◽

Tomasz Szmatoła ◽

Agnieszka Fornal ◽

...

Keyword(s):

Genotyping By Sequencing ◽

Read Depth ◽

Model Organisms ◽

Single Nucleotide Polymorphism Discovery ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Polymorphism Discovery ◽

Genome Wide ◽

Plumage Development ◽

Edar Gene

Poland is the largest European producer of goose, while goose breeding has become an essential and still increasing branch of the poultry industry. The most frequently bred goose is the White Kołuda® breed, constituting 95% of the country’s population, whereas geese of regional varieties are bred in smaller, conservation flocks. However, a goose’s genetic diversity is inaccurately explored, mainly because the advantages of the most commonly used tools are strongly limited in non-model organisms. One of the most accurate used markers for population genetics is single nucleotide polymorphisms (SNP). A highly efficient strategy for genome-wide SNP detection is genotyping-by-sequencing (GBS), which has been already widely applied in many organisms. This study attempts to use GBS in 12 conservative goose breeds and the White Kołuda® breed maintained in Poland. The GBS method allowed for the detection of 3833 common raw SNPs. Nevertheless, after filtering for read depth and alleles characters, we obtained the final markers panel used for a differentiation analysis that comprised 791 SNPs. These variants were located within 11 different genes, and one of the most diversified variants was associated with the EDAR gene, which is especially interesting as it participates in the plumage development, which plays a crucial role in goose breeding.

Download Full-text

PhredEM: A Phred-Score-Informed Genotype-Calling Approach for Next-Generation Sequencing Studies

10.1101/046136 ◽

2016 ◽

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-juan Hu

Keyword(s):

Logistic Regression ◽

Next Generation Sequencing ◽

Em Algorithm ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text

AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data

10.1101/2020.09.14.296525 ◽

2020 ◽

Author(s):

Kyle Fletcher ◽

Lin Zhang ◽

Juliana Gil ◽

Rongkui Han ◽

Keri Cavanaugh ◽

...

Keyword(s):

Linkage Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genetic Map ◽

Genotyping By Sequencing ◽

Genetic Maps ◽

Whole Genome ◽

Sequencing Data ◽

Analysis Pipeline ◽

Genome Assemblies

AbstractBackgroundGenetic maps are an important resource for validation of genome assemblies, trait discovery, and breeding. Next generation sequencing has enabled production of high-density genetic maps constructed with 10,000s of markers. Most current approaches require a genome assembly to identify markers. Our Assembly Free Linkage Analysis Pipeline (AFLAP) removes this requirement by using uniquely segregating k-mers as markers to rapidly construct a genotype table and perform subsequent linkage analysis. This avoids potential biases including preferential read alignment and variant calling.ResultsThe performance of AFLAP was determined in simulations and contrasted to a conventional workflow. We tested AFLAP using 100 F2 individuals of Arabidopsis thaliana, sequenced to low coverage. Genetic maps generated using k-mers contained over 130,000 markers that were concordant with the genomic assembly. The utility of AFLAP was then demonstrated by generating an accurate genetic map using genotyping-by-sequencing data of 235 recombinant inbred lines of Lactuca spp. AFLAP was then applied to 83 F1 individuals of the oomycete Bremia lactucae, sequenced to >5x coverage. The genetic map contained over 90,000 markers ordered in 19 large linkage groups. This genetic map was used to fragment, order, orient, and scaffold the genome, resulting in a much-improved reference assembly.ConclusionsAFLAP can be used to generate high density linkage maps and improve genome assemblies of any organism when a mapping population is available using whole genome sequencing or genotyping-by-sequencing data. Genetic maps produced for B. lactucae were accurately aligned to the genome and guided significant improvements of the reference assembly.

Download Full-text

CNV-P: a machine-learning framework for predicting high confident copy number variations

PeerJ ◽

10.7717/peerj.12564 ◽

2021 ◽

Vol 9 ◽

pp. e12564

Author(s):

Taifu Wang ◽

Jinghua Sun ◽

Xiuqing Zhang ◽

Wen-Jing Wang ◽

Qing Zhou

Keyword(s):

Machine Learning ◽

False Positive ◽

Copy Number ◽

Genetic Disorders ◽

Genetic Diseases ◽

Basic Research ◽

Read Depth ◽

Copy Number Variations ◽

Sequencing Data ◽

Learning Framework

Background Copy-number variants (CNVs) have been recognized as one of the major causes of genetic disorders. Reliable detection of CNVs from genome sequencing data has been a strong demand for disease research. However, current software for detecting CNVs has high false-positive rates, which needs further improvement. Methods Here, we proposed a novel and post-processing approach for CNVs prediction (CNV-P), a machine-learning framework that could efficiently remove false-positive fragments from results of CNVs detecting tools. A series of CNVs signals such as read depth (RD), split reads (SR) and read pair (RP) around the putative CNV fragments were defined as features to train a classifier. Results The prediction results on several real biological datasets showed that our models could accurately classify the CNVs at over 90% precision rate and 85% recall rate, which greatly improves the performance of state-of-the-art algorithms. Furthermore, our results indicate that CNV-P is robust to different sizes of CNVs and the platforms of sequencing. Conclusions Our framework for classifying high-confident CNVs could improve both basic research and clinical diagnosis of genetic diseases.

Download Full-text

Ancient introgression between distantly related white oaks (Quercus sect Quercus) shows evidence of climate-associated asymmetric gene exchange

Journal of Heredity ◽

10.1093/jhered/esab053 ◽

2021 ◽

Author(s):

Scott T O’Donnell ◽

Sorel T Fitz-Gibbon ◽

Victoria L Sork

Keyword(s):

Gene Flow ◽

Genotyping By Sequencing ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Scrub Oak ◽

Population Genetic Inference ◽

Genetic Inference ◽

California Floristic Province ◽

Python Package

Abstract Ancient introgression can be an important source of genetic variation that shapes the evolution and diversification of many taxa. Here, we estimate the timing, direction and extent of gene flow between two distantly related oak species in the same section (Quercus sect. Quercus). We estimated these demographic events using genotyping by sequencing data (GBS), which generated 25,702 single nucleotide polymorphisms (SNPs) for 24 individuals of California scrub oak (Quercus berberidifolia) and 23 individuals of Engelmann oak (Q. engelmannii). We tested several scenarios involving gene flow between these species using the diffusion approximation-based population genetic inference framework and model-testing approach of the Python package DaDi. We found that the most likely demographic scenario includes a bottleneck in Q. engelmannii that coincides with asymmetric gene flow from Q. berberidifolia into Q. engelmannii. Given that the timing of this gene flow coincides with the advent of a Mediterranean-type climate in the California Floristic Province, we propose that changing precipitation patterns and seasonality may have favored the introgression of climate-associated genes from the endemic into the non-endemic California oak.

Download Full-text