Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches

Abstract Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.

Download Full-text

Population genomics of East Asian ethnic groups

Hereditas ◽

10.1186/s41065-020-00162-w ◽

2020 ◽

Vol 157 (1) ◽

Author(s):

Ziqing Pan ◽

Shuhua Xu

Keyword(s):

Genetic Diversity ◽

East Asia ◽

Ethnic Groups ◽

Population Genomics ◽

Sequence Data ◽

East Asian ◽

Whole Genome Sequence ◽

Whole Genome ◽

Evolutionary Forces ◽

Asian Populations

AbstractEast Asia constitutes one-fifth of the global population and exhibits substantial genetic diversity. However, genetic investigations on populations in this region have been largely under-represented compared with European populations. Nonetheless, the last decade has seen considerable efforts and progress in genome-wide genotyping and whole-genome sequencing of the East-Asian ethnic groups. Here, we review the recent studies in terms of ancestral origin, population relationship, genetic differentiation, and admixture of major East- Asian groups, such as the Chinese, Korean, and Japanese populations. We mainly focus on insights from the whole-genome sequence data and also include the recent progress based on mitochondrial DNA (mtDNA) and Y chromosome data. We further discuss the evolutionary forces driving genetic diversity in East-Asian populations, and provide our perspectives for future directions on population genetics studies, particularly on underrepresented indigenous groups in East Asia.

Download Full-text

Homogeneity of Arabian Peninsula dromedary camel populations with signals of geographic distinction based on whole genome sequence data

Scientific Reports ◽

10.1038/s41598-021-04087-w ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Hussain Bahbahani ◽

Faisal Almathen

Keyword(s):

Genetic Diversity ◽

Genome Sequence ◽

Arabian Peninsula ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Dromedary Camels ◽

Genome Sequence Data ◽

Similarity Indices ◽

Bactrian Camels

AbstractDromedary camels in the Arabian Peninsula distribute along different geographical and ecological locations, e.g. desert, mountains and coasts. Here, we are aiming to explore the whole genome sequence data of ten dromedary populations from the Arabian Peninsula to assess their genetic structure, admixture levels, diversity and similarity indices. Upon including reference dromedary and Bactrian camel populations from Iran and Kazakhstan, we characterise inter-species and geographic genetic distinction between the dromedary and the Bactrian camels. Individual-based alpha genetic diversity profiles are found to be generally higher in Bactrian camels than dromedary populations, with the exception of five autosomes (NC_044525.1, NC_044534.1, NC_044540.1, NC_044542.1, NC_044544.1) at diversity orders (q ≥ 2). The Arabian Peninsula camels are generally homogenous, with a small degree of genetic distinction correlating with three geographic groups: North, Central and West; Southwest; and Southeast of the Arabian Peninsula. No significant variation in diversity or similarity indices are observed among the different Arabian Peninsula dromedary populations. This study contributes to our understanding of the genetic diversity of Arabian Peninsula dromedary camels. It will help conserve the genetic stock of this species and support the design of breeding programmes for genetic improvement of favorable traits.

Download Full-text

Development and Validation of Polymorphic Microsatellite Loci for the NA2 Lineage of Phytophthora ramorum from Whole Genome Sequence Data

Plant Disease ◽

10.1094/pdis-11-16-1586-re ◽

2017 ◽

Vol 101 (5) ◽

pp. 666-673 ◽

Cited By ~ 10

Author(s):

Marie-Claude Gagnon ◽

Nicolas Feau ◽

Angela L. Dale ◽

Braham Dhillon ◽

Richard C. Hamelin ◽

...

Keyword(s):

Genetic Diversity ◽

Population Structure ◽

Genome Sequence ◽

Sequence Data ◽

Ornamental Plants ◽

Phytophthora Ramorum ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data ◽

And Migration

Phytophthora ramorum is the causal agent of sudden oak death and sudden larch death, and is also responsible for causing ramorum blight on woody ornamental plants. Many microsatellite markers are available to characterize the genetic diversity and population structure of P. ramorum. However, only two markers are polymorphic in the NA2 lineage, which is predominant in Canadian nurseries. Microsatellite motifs were mined from whole-genome sequence data of six P. ramorum NA2 isolates. Of the 43 microsatellite primer pairs selected, 13 loci displayed different allele sizes among the four P. ramorum lineages, 10 loci displayed intralineage variation in the EU1, EU2, and/or NA1 lineages, and 12 microsatellites displayed polymorphism in the NA2 lineage. Genotyping of 272 P. ramorum NA2 isolates collected in nurseries in British Columbia, Canada, from 2004 to 2013 revealed 12 multilocus genotypes (MLGs). One MLG was dominant when examined over time and across sampling locations, and only a few mutations separated the 12 MLGs. The NA2 population observed in Canadian nurseries also showed no signs of sexual recombination, similar to what has been observed in previous studies. The markers developed in this study can be used to assess P. ramorum inter- and intralineage genetic diversity and generate a better understanding of the population structure and migration patterns of this important plant pathogen, especially for the lesser-characterized NA2 lineage.

Download Full-text

Whole-genome sequence data uncover loss of genetic diversity due to selection

Genetics Selection Evolution ◽

10.1186/s12711-016-0210-4 ◽

2016 ◽

Vol 48 (1) ◽

Cited By ~ 20

Author(s):

Sonia E. Eynard ◽

Jack J. Windig ◽

Sipke J. Hiemstra ◽

Mario P. L. Calus

Keyword(s):

Genetic Diversity ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data ◽

Loss Of Genetic Diversity

Download Full-text

Emergence of a Plant Pathogen in Europe Associated with Multiple Intercontinental Introductions

Applied and Environmental Microbiology ◽

10.1128/aem.01521-19 ◽

2019 ◽

Vol 86 (3) ◽

Cited By ~ 7

Author(s):

Blanca B. Landa ◽

Andreina I. Castillo ◽

Annalisa Giampetruzzi ◽

Alexandra Kahn ◽

Miguel Román-Écija ◽

...

Keyword(s):

Genetic Diversity ◽

Genome Sequence ◽

Sequence Data ◽

Xylella Fastidiosa ◽

Whole Genome Sequence ◽

Whole Genome ◽

Content Type ◽

Genome Sequence Data ◽

Multiple Introductions ◽

Limited Genetic Diversity

ABSTRACT Pathogen introductions have led to numerous disease outbreaks in naive regions of the globe. The plant pathogen Xylella fastidiosa has been associated with various recent epidemics in Europe affecting agricultural crops, such as almond, grapevine, and olive, but also endemic species occurring in natural forest landscapes and ornamental plants. We compared whole-genome sequences of X. fastidiosa subspecies multiplex from America and strains associated with recent outbreaks in southern Europe to infer their likely origins and paths of introduction within and between the two continents. Phylogenetic analyses indicated multiple introductions of X. fastidiosa subspecies multiplex into Italy, Spain, and France, most of which emerged from a clade with limited genetic diversity with a likely origin in California, USA. The limited genetic diversity observed in X. fastidiosa subspecies multiplex strains originating from California is likely due to the clade itself being an introduction from X. fastidiosa subspecies multiplex populations in the southeastern United States, where this subspecies is most likely endemic. Despite the genetic diversity found in some areas in Europe, there was no clear evidence of recombination occurring among introduced X. fastidiosa strains in Europe. Sequence type taxonomy, based on multilocus sequence typing (MLST), was shown, at least in one case, to not lead to monophyletic clades of this pathogen; whole-genome sequence data were more informative in resolving the history of introductions than MLST data. Although additional data are necessary to carefully tease out the paths of these recent dispersal events, our results indicate that whole-genome sequence data should be considered when developing management strategies for X. fastidiosa outbreaks. IMPORTANCE Xylella fastidiosa is an economically important plant-pathogenic bacterium that has emerged as a pathogen of global importance associated with a devastating epidemic in olive trees in Italy associated with X. fastidiosa subspecies pauca and other outbreaks in Europe, such as X. fastidiosa subspecies fastidiosa and X. fastidiosa subspecies multiplex in Spain and X. fastidiosa subspecies multiplex in France. We present evidence of multiple introductions of X. fastidiosa subspecies multiplex, likely from the United States, into Spain, Italy, and France. These introductions illustrate the risks associated with the commercial trade of plant material at global scales and the need to develop effective policy to limit the likelihood of pathogen pollution into naive regions. Our study demonstrates the need to utilize whole-genome sequence data to study X. fastidiosa introductions at outbreak stages, since a limited number of genetic markers does not provide sufficient phylogenetic resolution to determine dispersal paths or relationships among strains that are of biological and quarantine relevance.

Download Full-text

Whole genome sequence accuracy is improved by replication in a population of mutagenized sorghum

10.1101/095513 ◽

2016 ◽

Author(s):

Charles Addo-Quaye ◽

Mitch Tuinstra ◽

Nicola Carraro ◽

Clifford Weil ◽

Brian P. Dilkes

Keyword(s):

False Positive ◽

Reverse Genetics ◽

Variant Calling ◽

Whole Genome Sequence ◽

Chemical Mutagenesis ◽

Next Generation Sequencing Data ◽

Missense Mutations ◽

Sequencing Data ◽

Induced Mutations ◽

Sequence Coverage

ABSTRACTThe accurate detection of induced mutations is critical for both forward and reverse genetics studies. Experimental chemical mutagenesis induces relatively few single base changes per individual. In a complex eukaryotic genome, false positive detection of mutations can occur at or above this mutagenesis rate. We demonstrate here, using a population of ethyl methanesulfonate (EMS) treated Sorghum bicolor BTx623 individuals, that using replication to detect false positive induced variants in next-generation sequencing data permits higher throughput variant detection with greater accuracy. We used a lower sequence coverage depth (average of 7X) from 586 independently mutagenized individuals and detected 5,399,493 homozygous SNPs. Of these, 76% originated from only 57,872 genomic positions prone to false positive variant calling. These positions are characterized by high copy number paralogs where the error-prone SNP positions are at copies containing a variant at the SNP position. The ability of short stretches of homology to generate these error prone positions suggests that incompletely assembled or poorly mapped repeated sequences are one driver of these error prone positions. Removal of these false positives left 1,275,872 homozygous and 477,531 heterozygous EMS-induced SNPs which, congruent with the mutagenic mechanism of EMS, were greater than 98% G:C to A:T transitions. Through this analysis we generated a database of sequence indexed mutants of Sorghum. This collection contains 4,035 high impact homozygous mutations in 3,637 genes and 56,514 homozygous missense mutations in 23,227 genes. Each line contains, on average, 2,177 annotated homozygous SNPs per genome, including seven likely gene knockouts and 96 missense mutations. The number of mutations in a transcript was linearly correlated with the transcript length and also the G+C count, but not with the GC/AT ratio. Analysis of the detected mutagenized positions identified CG-rich patches, and flanking sequences strongly influenced EMS-induced mutation rates. Our method for detecting false-positive induced mutations is generally applicable to any organism, is independent of the choice of in silico variant-calling algorithm, and is most valuable when the true mutation rate is likely to be low, such as in laboratory induced mutations or somatic mutation detection in medicine.

Download Full-text

Faculty Opinions recommendation of Optimal algorithms for haplotype assembly from whole-genome sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13339986.14707085 ◽

2011 ◽

Author(s):

Alejandro Schaffer

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Optimal Algorithms ◽

Genome Sequence Data ◽

Haplotype Assembly

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Whole genome resequencing and custom genotyping unveil clonal lineages in ‘Malbec’ grapevines (Vitis vinifera L.)

Scientific Reports ◽

10.1038/s41598-021-87445-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Luciano Calderón ◽

Nuria Mauri ◽

Claudio Muñoz ◽

Pablo Carbonell-Bejerano ◽

Laura Bree ◽

...

Keyword(s):

Genetic Diversity ◽

Somatic Mutations ◽

Clonal Propagation ◽

Variant Calling ◽

Vitis Vinifera L ◽

Whole Genome ◽

Single Nucleotide Variants ◽

Genome Resequencing ◽

Diversity Pattern ◽

Whole Genome Resequencing

AbstractGrapevine cultivars are clonally propagated to preserve their varietal attributes. However, genetic variations accumulate due to the occurrence of somatic mutations. This process is anthropically influenced through plant transportation, clonal propagation and selection. Malbec is a cultivar that is well-appreciated for the elaboration of red wine. It originated in Southwestern France and was introduced in Argentina during the 1850s. In order to study the clonal genetic diversity of Malbec grapevines, we generated whole-genome resequencing data for four accessions with different clonal propagation records. A stringent variant calling procedure was established to identify reliable polymorphisms among the analyzed accessions. The latter procedure retrieved 941 single nucleotide variants (SNVs). A reduced set of the detected SNVs was corroborated through Sanger sequencing, and employed to custom-design a genotyping experiment. We successfully genotyped 214 Malbec accessions using 41 SNVs, and identified 14 genotypes that clustered in two genetically divergent clonal lineages. These lineages were associated with the time span of clonal propagation of the analyzed accessions in Argentina and Europe. Our results show the usefulness of this approach for the study of the scarce intra-cultivar genetic diversity in grapevines. We also provide evidence on how human actions might have driven the accumulation of different somatic mutations, ultimately shaping the Malbec genetic diversity pattern.

Download Full-text

Whole genome sequence data of Bacillus australimaris strain B28A, isolated from Marine Water in India

Data in Brief ◽

10.1016/j.dib.2021.107240 ◽

2021 ◽

pp. 107240

Author(s):

Wael Ali Mohammed Hadi ◽

Boby T Edwin ◽

A Jayakumaran Nair

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Marine Water ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Download Full-text