Hecaton: reliably detecting copy number variation in plant genomes using short read sequencing data

Abstract Background Copy number variation (CNV) is thought to actively contribute to adaptive evolution of plant species. While many computational algorithms are available to detect copy number variation from whole genome sequencing datasets, the typical complexity of plant data likely introduces false positive calls. Results To enable reliable and comprehensive detection of CNV in plant genomes, we developed Hecaton, a novel computational workflow tailored to plants, that integrates calls from multiple state-of-the-art algorithms through a machine-learning approach. In this paper, we demonstrate that Hecaton outperforms current methods when applied to short read sequencing data of Arabidopsis thaliana, rice, maize, and tomato. Moreover, it correctly detects dispersed duplications, a type of CNV commonly found in plant species, in contrast to several state-of-the-art tools that erroneously represent this type of CNV as overlapping deletions and tandem duplications. Finally, Hecaton scales well in terms of memory usage and running time when applied to short read datasets of domesticated and wild tomato accessions. Conclusions Hecaton provides a robust method to detect CNV in plants. We expect it to be of immediate interest to both applied and fundamental research on the relationship between genotype and phenotype in plants.

Download Full-text

Hecaton: reliably detecting copy number variation in plant genomes using short read sequencing data

10.1101/720805 ◽

2019 ◽

Author(s):

Raúl Wijfjes ◽

Sandra Smit ◽

Dick de Ridder

Keyword(s):

Copy Number Variation ◽

Plant Species ◽

Copy Number ◽

State Of The Art ◽

Sequencing Data ◽

Short Read ◽

Plant Genomes ◽

Short Read Sequencing ◽

Number Variation ◽

Multiple State

AbstractCopy number variation (CNV) is thought to actively contribute to adaptive evolution of plant species. While many computational algorithms are available to detect copy number variation from whole genome sequencing datasets, the typical complexity of plant data likely introduces false positive calls.To enable reliable and comprehensive detection of CNV in plant genomes, we developed Hecaton, a novel computational workflow tailored to plants, that integrates calls from multiple state-of-the-art algorithms through a machine-learning approach. In this paper, we demonstrate that Hecaton outperforms current methods when applied to short read sequencing data of A. thaliana, rice, maize, and tomato. Moreover, it correctly detects dispersed duplications, a type of CNV commonly found in plant species, in contrast to several state-of-the-art tools that erroneously represent this type of CNV as overlapping deletions and tandem duplications. Finally, Hecaton scales well in terms of memory usage and running time when applied to short read datasets of domesticated and wild tomato accessions. Hecaton provides a robust method to detect CNV in plants. We expect it to be of immediate interest to both applied and fundamental research on the relationship between genotype and phenotype in plants.

Download Full-text

Extensive gene duplication in Arabidopsis revealed by pseudo-heterozygosity

10.1101/2021.11.15.468652 ◽

2021 ◽

Author(s):

Benjamin Jaegle ◽

Luz Mayela Soto-Jimenez ◽

Robin Burns ◽

Fernando A. Rabanal ◽

Magnus Nordborg

Keyword(s):

Copy Number ◽

Structural Variation ◽

De Novo ◽

Sequencing Data ◽

Heterozygous Snps ◽

Mendelian Segregation ◽

Short Read ◽

Short Read Sequencing ◽

Snp Data ◽

Number Variation

Background: It is becoming apparent that genomes harbor massive amounts of structural variation, and that this variation has largely gone undetected for technical reasons. In addition to being inherently interesting, structural variation can cause artifacts when short-read sequencing data are mapped to a reference genome. In particular, spurious SNPs (that do not show Mendelian segregation) may result from mapping of reads to duplicated regions. Recalling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million heterozygous SNPs (44% of total). Given that Arabidopsis thaliana (A. thaliana) is highly selfing, we hypothesized that these SNPs reflected cryptic copy number variation, and investigated them further. Results: While genuine heterozygosity should occur in tracts within individuals, heterozygosity at a particular locus is instead shared across individuals in a manner that strongly suggests it reflects segregating duplications rather than actual heterozygosity. Focusing on pseudo-heterozygosity in annotated genes, we used GWAS to map the position of the duplicates, identifying 2500 putatively duplicated genes. The results were validated using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that, in fact, transpose together. Conclusions: Our study confirms that most heterozygous SNPs calls in A. thaliana are artifacts, and suggest that great caution is needed when analysing SNP data from short-read sequencing. The finding that 10% of annotated genes are copy-number variables, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggest that future analyses based on independently assembled genomes will be very informative.

Download Full-text

High resolution copy number inference in cancer using short-molecule nanopore sequencing

10.1101/2020.12.28.424602 ◽

2020 ◽

Author(s):

Timour Baslan ◽

Sam Kovaka ◽

Fritz J. Sedlazeck ◽

Yanming Zhang ◽

Robert Wappel ◽

...

Keyword(s):

Copy Number ◽

Cost Effective ◽

Chromosome Analysis ◽

Ease Of Use ◽

Precision Oncology ◽

Nanopore Sequencing ◽

Dna Molecules ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing

ABSTRACTGenome copy number is an important source of genetic variation in health and disease. In cancer, clinically actionable Copy Number Alterations (CNAs) can be inferred from short-read sequencing data, enabling genomics-based precision oncology. Emerging Nanopore sequencing technologies offer the potential for broader clinical utility, for example in smaller hospitals, due to lower instrument cost, higher portability, and ease of use. Nonetheless, Nanopore sequencing devices are limited in terms of the number of retrievable sequencing reads/molecules compared to short-read sequencing platforms. This represents a challenge for applications that require high read counts such as CNA inference. To address this limitation, we targeted the sequencing of short-length DNA molecules loaded at optimized concentration in an effort to increase sequence read/molecule yield from a single nanopore run. We show that sequencing short DNA molecules reproducibly returns high read counts and allows high quality CNA inference. We demonstrate the clinical relevance of this approach by accurately inferring CNAs in acute myeloid leukemia samples. The data shows that, compared to traditional approaches such as chromosome analysis/cytogenetics, short molecule nanopore sequencing returns more sensitive, accurate copy number information in a cost effective and expeditious manner, including for multiplex samples. Our results provide a framework for the sequencing of relatively short DNA molecules on nanopore devices with applications in research and medicine, that include but are not limited to, CNAs.

Download Full-text

Insights into dispersed duplications and complex structural mutations from whole genome sequencing 706 families

10.1101/2020.08.03.235358 ◽

2020 ◽

Author(s):

Christopher W. Whelan ◽

Robert E. Handsaker ◽

Giulio Genovese ◽

Seva Kashin ◽

Monkol Lek ◽

...

Keyword(s):

Gene Expression ◽

Copy Number Variation ◽

Copy Number ◽

De Novo ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Structural Mutations ◽

Or Gene ◽

Genomic Locations

AbstractTwo intriguing forms of genome structural variation (SV) – dispersed duplications, and de novo rearrangements of complex, multi-allelic loci – have long escaped genomic analysis. We describe a new way to find and characterize such variation by utilizing identity-by-descent (IBD) relationships between siblings together with high-precision measurements of segmental copy number. Analyzing whole-genome sequence data from 706 families, we find hundreds of “IBD-discordant” (IBDD) CNVs: loci at which siblings’ CNV measurements and IBD states are mathematically inconsistent. We found that commonly-IBDD CNVs identify dispersed duplications; we mapped 95 of these common dispersed duplications to their true genomic locations through family-based linkage and population linkage disequilibrium (LD), and found several to be in strong LD with genome-wide association (GWAS) signals for common diseases or gene expression variation at their revealed genomic locations. Other CNVs that were IBDD in a single family appear to involve de novo mutations in complex and multi-allelic loci; we identified 26 de novo structural mutations that had not been previously detected in earlier analyses of the same families by diverse SV analysis methods. These included a de novo mutation of the amylase gene locus and multiple de novo mutations at chromosome 15q14. Combining these complex mutations with more-conventional CNVs, we estimate that segmental mutations larger than 1kb arise in about one per 22 human meioses. These methods are complementary to previous techniques in that they interrogate genomic regions that are home to segmental duplication, high CNV allele frequencies, and multi-allelic CNVs.Author SummaryCopy number variation is an important form of genetic variation in which individuals differ in the number of copies of segments of their genomes. Certain aspects of copy number variation have traditionally been difficult to study using short-read sequencing data. For example, standard analyses often cannot tell whether the duplicated copies of a segment are located near the original copy or are dispersed to other regions of the genome. Another aspect of copy number variation that has been difficult to study is the detection of mutations in the copy number of DNA segments passed down from parents to their children, particularly when the mutations affect genome segments which already display common copy number variation in the population. We develop an analytical approach to solving these problems when sequencing data is available for all members of families with at least two children. This method is based on determining the number of parental haplotypes the two siblings share at each location in their genome, and using that information to determine the possible inheritance patterns that might explain the copy numbers we observe in each family member. We show that dispersed duplications and mutations can be identified by looking for copy number variants that do not follow these expected inheritance patterns. We use this approach to determine the location of 95 common duplications which are dispersed to distant regions of the genome, and demonstrate that these duplications are linked to genetic variants that affect disease risk or gene expression levels. We also identify a set of copy number mutations not detected by previous analyses of sequencing data from a large cohort of families, and show that repetitive and complex regions of the genome undergo frequent mutations in copy number.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

A Deep Learning Approach for Detecting Copy Number Variation in Next-Generation Sequencing Data

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400596 ◽

2019 ◽

Vol 9 (11) ◽

pp. 3575-3582 ◽

Cited By ~ 5

Author(s):

Tom Hill ◽

Robert L. Unckless

Keyword(s):

Deep Learning ◽

Next Generation Sequencing ◽

Copy Number Variation ◽

Copy Number ◽

Next Generation Sequencing Data ◽

Learning Approach ◽

Next Generation ◽

Sequencing Data ◽

Number Variation ◽

Generation Sequencing

Download Full-text

Structural genome analysis in cultivated potato taxa

Theoretical and Applied Genetics ◽

10.1007/s00122-019-03519-6 ◽

2019 ◽

Vol 133 (3) ◽

pp. 951-966 ◽

Cited By ~ 3

Author(s):

Maria Kyriakidou ◽

Sai Reddy Achakkagari ◽

José Héctor Gálvez López ◽

Xinyi Zhu ◽

Chen Yu Tang ◽

...

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Agronomic Traits ◽

Genomic Variation ◽

Sequencing Data ◽

Structural Variations ◽

Structural Genomic ◽

Ploidy Levels ◽

Cultivated Potato ◽

Number Variation

Abstract Key message Twelve potato accessions were selected to represent two principal views on potato taxonomy. The genomes were sequenced and analyzed for structural variation (copy number variation) against three published potato genomes. Abstract The common potato (Solanum tuberosum L.) is an important staple crop with a highly heterozygous and complex tetraploid genome. The other taxa of cultivated potato contain varying ploidy levels (2X–5X), and structural variations are common in the genomes of these species, likely contributing to the diversification or agronomic traits during domestication. Increased understanding of the genomes and genomic variation will aid in the exploration of novel agronomic traits. Thus, sequencing data from twelve potato landraces, representing the four ploidy levels, were used to identify structural genomic variation compared to the two currently available reference genomes, a double monoploid potato genome and a diploid inbred clone of S. chacoense. The results of a copy number variation analysis showed that in the majority of the genomes, while the number of deletions is greater than the number of duplications, the number of duplicated genes is greater than the number of deleted ones. Specific regions in the twelve potato genomes have a high density of CNV events. Further, the auxin-induced SAUR genes (involved in abiotic stress), disease resistance genes and the 2-oxoglutarate/Fe(II)-dependent oxygenase superfamily proteins, among others, had increased copy numbers in these sequenced genomes relative to the references.

Download Full-text

Erratum to: CoNVEX: copy number variation estimation in exome sequencing data using HMM

BMC Bioinformatics ◽

10.1186/1471-2105-14-s2-s26 ◽

2013 ◽

Vol 14 (S2) ◽

Cited By ~ 2

Author(s):

Kaushalya C Amarasinghe ◽

Jason Li ◽

Saman K Halgamuge

Keyword(s):

Copy Number Variation ◽

Exome Sequencing ◽

Copy Number ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Number Variation

Download Full-text

CovCopCan: An efficient tool to detect Copy Number Variation from amplicon sequencing data in inherited diseases and cancer

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007503 ◽

2020 ◽

Vol 16 (2) ◽

pp. e1007503 ◽

Cited By ~ 1

Author(s):

Paco Derouault ◽

Jasmine Chauzeix ◽

David Rizzo ◽

Federica Miressi ◽

Corinne Magdelaine ◽

...

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Amplicon Sequencing ◽

Sequencing Data ◽

Efficient Tool ◽

Inherited Diseases ◽

Number Variation

Download Full-text

Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Genes ◽

10.3390/genes11020141 ◽

2020 ◽

Vol 11 (2) ◽

pp. 141 ◽

Cited By ~ 5

Author(s):

Feichen Shen ◽

Jeffrey M. Kidd

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Sequence Data ◽

Data Sets ◽

Short Read ◽

Major Mechanism ◽

Rapid Construction ◽

A Genome ◽

Number Variation ◽

Short Read Sequence

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Download Full-text