Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Download Full-text

Copy-number variation in sporadic amyotrophic lateral sclerosis: a genome-wide screen

The Lancet Neurology ◽

10.1016/s1474-4422(08)70048-6 ◽

2008 ◽

Vol 7 (4) ◽

pp. 319-326 ◽

Cited By ~ 70

Author(s):

Hylke M Blauw ◽

Jan H Veldink ◽

Michael A van Es ◽

Paul W van Vught ◽

Christiaan GJ Saris ◽

...

Keyword(s):

Amyotrophic Lateral Sclerosis ◽

Copy Number Variation ◽

Copy Number ◽

Sporadic Amyotrophic Lateral Sclerosis ◽

Genome Wide ◽

A Genome ◽

Number Variation ◽

Lateral Sclerosis

Download Full-text

A genome-wide survey of copy number variation regions in various chicken breeds by array comparative genomic hybridization method

Animal Genetics ◽

10.1111/j.1365-2052.2011.02308.x ◽

2012 ◽

Vol 43 (3) ◽

pp. 282-289 ◽

Cited By ~ 26

Author(s):

Y. Wang ◽

X. Gu ◽

C. Feng ◽

C. Song ◽

X. Hu ◽

...

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Array Comparative Genomic Hybridization ◽

Comparative Genomic ◽

Hybridization Method ◽

Genome Wide ◽

A Genome ◽

Number Variation ◽

Chicken Breeds ◽

Genome Wide Survey

Download Full-text

A genome-wide analysis of copy number variation in Murciano-Granadina goats

Genetics Selection Evolution ◽

10.1186/s12711-020-00564-4 ◽

2020 ◽

Vol 52 (1) ◽

Author(s):

Dailu Guan ◽

Amparo Martínez ◽

Anna Castelló ◽

Vincenzo Landi ◽

María Gracia Luigi-Sierra ◽

...

Keyword(s):

Copy Number Variation ◽

Copy Number ◽

Genome Wide Analysis ◽

Genome Wide ◽

A Genome ◽

Number Variation

Download Full-text

Mutational sequencing for accurate count and long-range assembly

10.1101/149740 ◽

2017 ◽

Author(s):

Vijay Kumar ◽

Julie Rosenbaum ◽

Zihua Wang ◽

Talitha Forcier ◽

Michael Ronemus ◽

...

Keyword(s):

Long Range ◽

Copy Number ◽

Sequence Data ◽

Template Molecule ◽

Short Read ◽

Unique Pattern ◽

Short Read Sequence

ABSTRACTWe introduce a new protocol, mutational sequencing or muSeq, which randomly deaminates unmethylated cytosines at a fixed and tunable rate. The muSeq protocol marks each initial template molecule with a unique mutation signature that is present in every copy of the template, and in every fragmented copy of a copy. In the sequenced read data, this signature is observed as a unique pattern of C-to-T or G-to-A nucleotide conversions. Clustering reads with the same conversion pattern enables accurate count and long-range assembly of initial template molecules from short-read sequence data. We explore count and low-error sequencing by profiling a 135,000 fragment PstI representation, demonstrating that muSeq improves copy number inference and significantly reduces sporadic sequencer error. We explore long-range assembly in the context of cDNA, generating contiguous transcript clusters greater than 3,000 bp in length. The muSeq assemblies reveal transcriptional diversity not observable from short-read data alone.

Download Full-text

MOST: a modified MLST typing tool based on short read sequencing

PeerJ ◽

10.7717/peerj.2308 ◽

2016 ◽

Vol 4 ◽

pp. e2308 ◽

Cited By ~ 63

Author(s):

Rediat Tewolde ◽

Timothy Dallman ◽

Ulf Schaefer ◽

Carmen L. Sheppard ◽

Philip Ashton ◽

...

Keyword(s):

Conventional Method ◽

Sequence Data ◽

Pcr Amplification ◽

Housekeeping Genes ◽

Data Sets ◽

Bacterial Genomes ◽

Bacterial Populations ◽

Short Read ◽

Short Read Sequence ◽

Low Coverage

Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets fromSalmonella enteridisandStreptococcus pneumoniae. Of the 323 samples, 92.9% (n= 300), 97.5% (n= 315) and 99.7% (n= 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n= 49) and 67.3% (n= 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches.

Download Full-text