A hidden Markov-model for gene mapping based on whole-genome next generation sequencing data

Abstract Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Download Full-text

Benchmarking variant identification tools for plant diversity discovery

BMC Genomics ◽

10.1186/s12864-019-6057-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 7

Author(s):

Xing Wu ◽

Christopher Heffelfinger ◽

Hongyu Zhao ◽

Stephen L. Dellaporta

Keyword(s):

Next Generation Sequencing ◽

High Throughput Sequencing ◽

Crop Improvement ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Variant Discovery ◽

Variant Filtering ◽

Generation Sequencing

Abstract Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. A 2-step imputation method, which utilized a set of high-confidence SNPs as the reference panel, showed up to 60% higher accuracy than direct LD-based imputation. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Download Full-text

Automated genotyping of microsatellite loci from feces with high throughput sequences

PLoS ONE ◽

10.1371/journal.pone.0258906 ◽

2021 ◽

Vol 16 (10) ◽

pp. e0258906

Author(s):

Isabel Salado ◽

Alberto Fernández-Gil ◽

Carles Vilà ◽

Jennifer A. Leonard

Keyword(s):

Next Generation Sequencing ◽

High Throughput ◽

Microsatellite Loci ◽

High Throughput Sequencing ◽

Individual Identification ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Software Packages ◽

Generation Sequencing

Ecological and conservation genetic studies often use noninvasive sampling, especially with elusive or endangered species. Because microsatellites are generally short in length, they can be amplified from low quality samples such as feces. Microsatellites are highly polymorphic so few markers are enough for reliable individual identification, kinship determination, or population characterization. However, the genotyping process from feces is expensive and time consuming. Given next-generation sequencing (NGS) and recent software developments, automated microsatellite genotyping from NGS data may now be possible. These software packages infer the genotypes directly from sequence reads, increasing throughput. Here we evaluate the performance of four software packages to genotype microsatellite loci from Iberian wolf (Canis lupus) feces using NGS. We initially combined 46 markers in a single multiplex reaction for the first time, of which 19 were included in the final analyses. Megasat was the software that provided genotypes with fewer errors. Coverage over 100X provided little additional information, but a relatively high number of PCR replicates were necessary to obtain a high quality genotype from highly unoptimized, multiplexed reactions (10 replicates for 18 of the 19 loci analyzed here). This could be reduced through optimization. The use of new bioinformatic tools and next-generation sequencing data to genotype these highly informative markers may increase throughput at a reasonable cost and with a smaller amount of laboratory work. Thus, high throughput sequencing approaches could facilitate the use of microsatellites with fecal DNA to address ecological and conservation questions.

Download Full-text

Benchmarking Variant Identification Tools for Plant Diversity Discovery

10.21203/rs.2.9666/v1 ◽

2019 ◽

Author(s):

Xing Wu ◽

Christopher Heffelfinger ◽

Hongyu Zhao ◽

Stephen L. Dellaporta

Keyword(s):

Next Generation Sequencing ◽

High Throughput Sequencing ◽

Crop Improvement ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Variant Discovery ◽

Variant Filtering ◽

Generation Sequencing

Abstract Background The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets. Results A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. The 2-step imputation which utilized a set of high-confidence SNPs as the reference panel showed up to 60% higher accuracy than direct LD-based imputation method. Conclusions Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text