Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads

Mapping Intimacies ◽

10.1101/318196 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ehsan Motazedi ◽

Richard Finkers ◽

Chris Maliepaard ◽

Dick de Ridder

Keyword(s):

Sequence Data ◽

Pedigree Information ◽

Short Sequence ◽

Haplotype Estimation ◽

Illumina Hiseq ◽

Single Chromosome ◽

Sequencing Errors ◽

Allele Dosage ◽

Missing Genotypes ◽

Family Based

AbstractDNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information (using the overlaps of the reads), the haplotypes of an individual can be obtained. Adding parent-offspring relationships to the read information in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can compensate for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to shallow coverage). This improvement is especially beneficial for polyploid organisms, which have more than two copies of each chromosome and are therefore more difficult to be haplotyped compared to diploids. We develop a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by considering the transmission of the haplotypes from the parents to the offspring. In addition, PopPoly employs this information to improve genotype dosage estimation and to call missing genotypes in the population. Through realistic simulations, we compare PopPoly to other haplotyping methods and show its better performance in terms of phasing accuracy and the accuracy of phased genotypes. We apply PopPoly to estimate the parental and offspring haplotypes for a tetraploid potato cross with 10 offspring, using Illumina HiSeq sequence data of 9 genomic regions involved in plant maturity and tuberisation.

Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads

Frontiers in Genetics ◽

10.3389/fgene.2019.00335 ◽

2019 ◽

Vol 10 ◽

Cited By ~ 1

Author(s):

Ehsan Motazedi ◽

Chris Maliepaard ◽

Richard Finkers ◽

Richard Visser ◽

Dick de Ridder

Keyword(s):

Short Sequence ◽

Haplotype Estimation ◽

Allele Dosage ◽

Family Based

SimRVSequences: an R package to simulate genetic sequence data for pedigrees

10.1101/534552 ◽

2019 ◽

Author(s):

Christina Nieuwoudt ◽

Angela Brooks-Wilson ◽

Jinko Graham

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Simulated Data ◽

R Package ◽

Single Nucleotide Variant ◽

Ease Of Use ◽

Case Control Studies ◽

Single Nucleotide ◽

Sequencing Errors ◽

Family Based

1AbstractSummaryFamily-based studies have several advantages over case-control studies for finding causal rare variants for a disease; these include increased power, smaller sample size requirements, and improved detection of sequencing errors. However, collecting suitable families and compiling their data is time-consuming and expensive. To evaluate methodology to identify causal rare variants in family-based studies, one can use simulated data. For this purpose we present the R package SimRVSequences. Users supply a sample of pedigrees and single-nucleotide variant data from a sample of unrelated individuals representing the pedigree founders. Users may also model genetic heterogeneity among families. For ease of use, SimRVSequences offers methods to import and format single-nucleotide variant data and pedigrees from existing software.Availability and ImplementationSimRVSequences is available as a library for R≥ 3.5.0 on the comprehensive R archive network.

Stability of SARS-CoV-2 Phylogenies

10.1101/2020.06.08.141127 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yatish Turakhia ◽

Bryan Thornlow ◽

Landen Gozashti ◽

Angie S. Hinrichs ◽

Jason D. Fernandes ◽

...

Keyword(s):

Binding Sites ◽

Sequence Data ◽

Scientific Discovery ◽

Lineage Tracing ◽

Protein Coding ◽

Sequencing Errors ◽

Scientific Inference ◽

Recurrent Mutations ◽

Sequence Quality ◽

Essential Sequence

AbstractThe SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation and/or recombination among viral lineages. We suggest how samples can be screened and problematic mutations removed. We also develop tools for comparing and visualizing differences among phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.ForewordWe wish to thank all groups that responded rapidly by producing these invaluable and essential sequence data. Their contributions have enabled an unprecedented, lightning-fast process of scientific discovery---truly an incredible benefit for humanity and for the scientific community. We emphasize that most lab groups with whom we associate specific suspicious alleles are also those who have produced the most sequence data at a time when it was urgently needed. We commend their efforts. We have already contacted each group and many have updated their sequences. Our goal with this work is not to highlight potential errors, but to understand the impacts of these and other kinds of highly recurrent mutations so as to identify commonalities among the suspicious examples that can improve sequence quality and analysis going forward.

The CMT2D Locus: Refined Genetic Position and Construction of a Bacterial Clone-Based Physical Map

Genome Research ◽

10.1101/gr.9.6.568 ◽

1999 ◽

Vol 9 (6) ◽

pp. 568-574 ◽

Cited By ~ 1

Author(s):

Rachel E. Ellsworth ◽

Victor Ionasescu ◽

Charles Searby ◽

Val C. Sheffield ◽

Valerie V. Braden ◽

...

Keyword(s):

Linkage Analysis ◽

Yeast Artificial Chromosome ◽

Sequence Data ◽

Physical Map ◽

Muscular Atrophy ◽

Large Family ◽

Artificial Chromosome ◽

Single Chromosome ◽

Mixed Features ◽

Bacterial Clone

Charcot-Marie-Tooth (CMT) disease is a progressive neuropathy of the peripheral nervous system, typically characterized by muscle weakness of the distal limbs. CMT is noted for its genetic heterogeneity, with four distinct loci already identified for the axonal form of the disease (CMT2). In 1996, linkage analysis of a single large family revealed the presence of a CMT2 locus on chromosome 7p14 (designatedCMT2D). Additional families have been linked subsequently to the same genomic region, including one with distal spinal muscular atrophy (dSMA) and one with mixed features of dSMA and CMT2; symptoms in both of these latter families closely resemble those seen in the original CMT2D family. There is thus a distinct possibility that CMT2 and dSMA encountered in these families reflect allelic heterogeneity at a single chromosome 7 locus. In the study reported here, we have performed more detailed linkage analysis of the original CMT2D family based on new knowledge of the physical locations of various genetic markers. The region containing the CMT2D gene, as defined by the original family, overlaps with those defined by at least two other families with CMT2 and/or dSMA symptoms. Both yeast artificial chromosome (YAC) and bacterial clone-based [bacterial artificial chromosome (BAC) and P1-derived artificial chromosome (PAC)] contig maps spanning ∼3.4 Mb have been assembled across the combinedCMT2D critical region, with the latter providing suitable clones for systematic sequencing of the interval. Preliminary analyses have already revealed at least 28 candidate genes and expressed-sequence tags (ESTs). The mapping information reported here in conjunction with the evolving sequence data should expedite the identification of the CMT2D/dSMA gene or genes.

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Genome resequencing data for Iranian local dogs and wolves

BMC Research Notes ◽

10.1186/s13104-020-05271-3 ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Zeinab Amiri Ghanatsaman ◽

Guo-Dong Wang ◽

Masood Asadi Fozi ◽

Ya-Ping Zhang ◽

Ali Esmailizadeh

Keyword(s):

Evolutionary Biology ◽

Sequence Data ◽

Genome Resequencing ◽

Illumina Hiseq ◽

Blood Samples ◽

Animal Domestication ◽

Hunting Dogs ◽

Physical Traits ◽

Whole Genome Resequencing ◽

Dog Domestication

Abstract Objective The data provided herein represent the whole-genome resequencing data related to three wolves and three Iranian local dogs. The understanding of genome evolution during animal domestication is an interesting subject in genome biology. Dog is an excellent model for understanding of domestication due to its considerable variety of behavioral and physical traits. The Zagros area of current day Iran has been identified as one of the initial centers of animal domestication. The availability of the complete genome sequences of Iranian local canids can be a valuable resource for researchers to address questions and testing hypotheses on the dog domestication process. Data description We collected blood samples from six Iranian local canids including two hunting dogs (Saluki breed), a mastiff dog (Qahderijani ecotype) and three wolves. We extracted genomic DNA from blood samples. Sequence data were produced using the Illumina HiSeq 2500 system. All sequence data are available in the National Genomics Data Center (NGDC), Genome Sequence Archive (GSA) database under the accession of CRA001324 and the National Center for Biotechnology Information (NCBI) under the accession of PRJNA639312. The short-read sequences with the mean depth of 16X were aligned to the dog reference genome (CanFam3.1) and achieved 99% coverage of the reference assembly. The obtained information from this experiment will be useful in evolutionary biology.

Drosophila genomes and the development of affordable molecular markers for species genotyping

Genome ◽

10.1139/g10-120 ◽

2011 ◽

Vol 54 (4) ◽

pp. 341-347 ◽

Cited By ~ 1

Author(s):

Leigh Minuk ◽

Alberto Civetta

Keyword(s):

Molecular Markers ◽

Genome Sequence ◽

Sequence Data ◽

Closely Related Species ◽

Genome Sequence Data ◽

Sequencing Errors ◽

Genome Data ◽

Indel Polymorphisms ◽

Partial Genome ◽

Rule Out

The recent completion of genome sequencing of 12 species of Drosophila has provided a powerful resource for hypothesis testing, as well as the development of technical tools. Here we take advantage of genome sequence data from two closely related species of Drosophila, Drosophila simulans and Drosophila sechellia, to quickly identify candidate molecular markers for genotyping based on expected insertion or deletion (indel) differences between species. Out of 64 candidate molecular markers selected along the second and third chromosome of Drosophila, 51 molecular markers were validated using PCR and gel electrophoresis. We found that the 20% error rate was due to sequencing errors in the genome data, although we cannot rule out possible indel polymorphisms. The approach has the advantage of being affordable and quick, as it only requires the use of bioinformatics tools for predictions and a PCR and agarose gel based assay for validation. Moreover, the approach could be easily extended to a wide variety of taxa with the only limitation being the availability of complete or partial genome sequence data.

A comparative analysis of family-based and population-based association tests using whole genome sequence data

BMC Proceedings ◽

10.1186/1753-6561-8-s1-s33 ◽

2014 ◽

Vol 8 (Suppl 1) ◽

pp. S33 ◽

Cited By ~ 6

Author(s):

Jin J Zhou ◽

Wai-Ki Yip ◽

Michael H Cho ◽

Dandi Qiao ◽

Merry-Lynn N McDonald ◽

...

Keyword(s):

Comparative Analysis ◽

Genome Sequence ◽

Sequence Data ◽

Population Based ◽

Whole Genome Sequence ◽

Whole Genome ◽

Association Tests ◽

Genome Sequence Data ◽

Family Based

Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data

10.1101/026922 ◽

2015 ◽

Cited By ~ 2

Author(s):

Jeremy A. Frank ◽

Yao Pan ◽

Ave Tooming-Klunderud ◽

Vincent G.H. Eijsink ◽

Alice C. McHardy ◽

...

Keyword(s):

Sequence Data ◽

Consensus Sequence ◽

Dna Assembly ◽

Illumina Hiseq ◽

Average Contig Length ◽

Long Read ◽

And Performance ◽

And Function ◽

Taxonomic Binning ◽

Large Contigs

DNA assembly is a core methodological step in metagenomic pipelines used to study the structure and function within microbial communities. Here we investigate the utility of Pacific Biosciences long and high accuracy circular consensus sequencing (CCS) reads for metagenomics projects. We compared the application and performance of both PacBio CCS and Illumina HiSeq data with assembly and taxonomic binning algorithms using metagenomic samples representing a complex microbial community. Eight SMRT cells produced approximately 94 Mb of CCS reads from a biogas reactor microbiome sample, which averaged 1319 nt in length and 99.7 % accuracy. CCS data assembly generated a comparative number of large contigs greater than 1 kb, to those assembled from a ~190x larger HiSeq dataset (~18 Gb) produced from the same sample (i.e approximately 62 % of total contigs). Hybrid assemblies using PacBio CCS and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length and number of large contigs. The incorporation of CCS data produced significant enhancements in taxonomic binning and genome reconstruction of two dominant phylotypes, which assembled and binned poorly using HiSeq data alone. Collectively these results illustrate the value of PacBio CCS reads in certain metagenomics applications.

Agro-morphological, yield, and genotyping-by-sequencing data of selected wheat germplasm

10.1101/2020.07.18.209882 ◽

2020 ◽

Author(s):

Madiha Islam ◽

Abdullah ◽

Bibi Zubaida ◽

Nosheen Shafqat ◽

Rabia Masood ◽

...

Keyword(s):

Triticum Aestivum ◽

Sequence Data ◽

Genotyping By Sequencing ◽

Wheat Breeding ◽

Sequencing Data ◽

Illumina Hiseq ◽

Yield Data ◽

Single Nucleotide ◽

Breeding Programs ◽

Short Reads

AbstractWheat (Triticum aestivum) is the most important staple food in Pakistan. Knowledge of its genetic diversity is critical for designing effective crop breeding programs. Here we report agro-morphological and yield data for 112 genotypes (including 7 duplicates) of wheat (Triticum aestivum) cultivars, advance lines, landraces and wild relatives, collected from several research institutes and breeders across Pakistan. We also report genotyping-by-sequencing (GBS) data for a selected sub-set of 52 genotypes. Sequencing was performed using Illumina HiSeq 2500 platform using the PE150 run. Data generated per sample ranged from 1.01 to 2.5 Gb; 90% of the short reads exhibited quality scores above 99.9%. TGACv1 wheat genome was used as a reference to map short reads from individual genotypes and to filter single nucleotide polymorphic loci (SNPs). On average, 364,074±54479 SNPs per genotype were recorded. The sequencing data has been submitted to the SRA database of NCBI (accession number SRP179096). The agro-morphological and yield data, along with the sequence data and SNPs will be invaluable resources for wheat breeding programs in future.