AluMine: alignment-free method for the discovery of polymorphic Alu element insertions

Mapping Intimacies ◽

10.1101/588434 ◽

2019 ◽

Author(s):

Tarmo Puurand ◽

Viktoria Kukuškina ◽

Fanny-Dhelia Pajuste ◽

Maido Remm

Keyword(s):

Reference Genome ◽

Genome Project ◽

Personal Genomics ◽

Robust Analysis ◽

Alu Elements ◽

Alignment Free ◽

Alu Element ◽

Hardware Configuration ◽

Personal Genomes ◽

Discovery Pipeline

ABSTRACTBackgroundRecently, alignment-free sequence analysis methods have gained popularity in the field of personal genomics. These methods are based on counting frequencies of short k-mer sequences, thus allowing faster and more robust analysis compared to traditional alignment-based methods.ResultsWe have created a fast alignment-free method, AluMine, to analyze polymorphic insertions of Alu elements in the human genome. We tested the method on 2,241 individuals from the Estonian Genome Project and identified 28,962 potential polymorphic Alu element insertions. Each tested individual had on average 1,574 Alu element insertions that were different from those in the reference genome. In addition, we propose an alignment-free genotyping method that uses the frequency of insertion/deletion-specific 32-mer pairs to call the genotype directly from raw sequencing reads. Using this method, the concordance between the predicted and experimentally observed genotypes was 98.7%. The running time of the discovery pipeline is approximately 2 hours per individual. The genotyping of potential polymorphic insertions takes between 0.4 and 4 hours per individual, depending on the hardware configuration.ConclusionsAluMine provides tools that allow discovery of novel Alu element insertions and/or genotyping of known Alu element insertions from personal genomes within few hours.

Korean Genome Project: 1094 Korean personal genomes with clinical information

Science Advances ◽

10.1126/sciadv.aaz7835 ◽

2020 ◽

Vol 6 (22) ◽

pp. eaaz7835 ◽

Cited By ~ 2

Author(s):

Sungwon Jeon ◽

Youngjune Bhak ◽

Yeonsong Choi ◽

Yeonsu Jeon ◽

Seunghoon Kim ◽

...

Keyword(s):

Genome Wide Association Study ◽

Imputation Accuracy ◽

Clinical Information ◽

Genome Project ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

A Genome ◽

Whole Genomes ◽

Personal Genomes

We present the initial phase of the Korean Genome Project (Korea1K), including 1094 whole genomes (sequenced at an average depth of 31×), along with data of 79 quantitative clinical traits. We identified 39 million single-nucleotide variants and indels of which half were singleton or doubleton and detected Korean-specific patterns based on several types of genomic variations. A genome-wide association study illustrated the power of whole-genome sequences for analyzing clinical traits, identifying nine more significant candidate alleles than previously reported from the same linkage disequilibrium blocks. Also, Korea1K, as a reference, showed better imputation accuracy for Koreans than the 1KGP panel. As proof of utility, germline variants in cancer samples could be filtered out more effectively when the Korea1K variome was used as a panel of normals compared to non-Korean variome sets. Overall, this study shows that Korea1K can be a useful genotypic and phenotypic resource for clinical and ethnogenetic studies.

An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes

Nature Communications ◽

10.1038/ncomms13637 ◽

2016 ◽

Vol 7 (1) ◽

Cited By ~ 26

Author(s):

Yun Sung Cho ◽

Hyunho Kim ◽

Hak-Min Kim ◽

Sungwoong Jho ◽

JeHoon Jun ◽

...

Keyword(s):

Large Scale ◽

New Technologies ◽

Reference Genome ◽

Genomic Structure ◽

Genome Project ◽

Personal Genome ◽

Personal Genomic ◽

Personal Reference ◽

Scale Population ◽

Genome Assemblies

Abstract Human genomes are routinely compared against a universal reference. However, this strategy could miss population-specific and personal genomic variations, which may be detected more efficiently using an ethnically relevant or personal reference. Here we report a hybrid assembly of a Korean reference genome (KOREF) for constructing personal and ethnic references by combining sequencing and mapping methods. We also build its consensus variome reference, providing information on millions of variants from 40 additional ethnically homogeneous genomes from the Korean Personal Genome Project. We find that the ethnically relevant consensus reference can be beneficial for efficient variant detection. Systematic comparison of human assemblies shows the importance of assembly quality, suggesting the necessity of new technologies to comprehensively map ethnic and personal genomic structure variations. In the era of large-scale population genome projects, the leveraging of ethnicity-specific genome assemblies as well as the human reference genome will accelerate mapping all human genome diversity.

Human chromosome 5 sequence primer amplifies Alu polymorphisms on chromosomes 2 and 17

Genome ◽

10.1139/g93-042 ◽

1993 ◽

Vol 36 (2) ◽

pp. 302-309 ◽

Cited By ~ 2

Author(s):

Lynn E. Bernard ◽

Stephen Wood

Keyword(s):

Human Chromosome ◽

Chromosome 5 ◽

Alu Elements ◽

Chain Reactions ◽

Polymerase Chain Reactions ◽

Human Dna ◽

Alu Element ◽

Polymerase Chain ◽

Nonspecific Amplification ◽

Human Chromosome 5

Members of the Alu family of repetitive elements occur frequently in the human genome and are often polymorphic. Techniques involving Alu element mediated polymerase chain reactions (Alu PCR) allow the isolation of region-specific human DNA fragments from mixed DNA sources. Such fragments are a source of region-specific Alu elements useful for the detection of Alu-related polymorphisms. A clone from human chromosome 5, corresponding to locus D5F40S1, was isolated using Alu PCR differential hybridization. Alu elements within this clone were investigated for the presence of potentially polymorphic 3′ polyA tails. Primers were devised to amplify the 3′ polyA tail of an Alu element present within the clone. One primer, D5F40S1-T, was specific to the DNA flanking the 3′ end of the Alu element, and the other primer was homologous to sequences within the element. When these primers were used in PCR reactions, products from chromosomes 2 and 17 (loci D2F40S2 and D17F40S3) were amplified in addition to the expected product from chromosome 5. The most likely explanation for this nonspecific amplification is that the D5F40S1-T primer is located within a low-copy repetitive element that is 3′ of the Alu element. This phenomenon presents a potential problem for the identification of region-specific Alu polymorphisms.Key words: Alu polymorphism, human chromosome 5, polymerase chain reaction, D5F40S1, D2F40S2, D17F40S3.

S-conLSH: alignment-free gapped mapping of noisy long reads

BMC Bioinformatics ◽

10.1186/s12859-020-03918-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Angana Chakraborty ◽

Burkhard Morgenstern ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Reference Genome ◽

Genome Mapping ◽

Sequence Data ◽

Downstream Processing ◽

Read Length ◽

Alignment Free ◽

Spaced Seeds ◽

Long Reads ◽

Gc Bias ◽

High Level

Abstract Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.

AluMine: alignment-free method for the discovery of polymorphic Alu element insertions

Mobile DNA ◽

10.1186/s13100-019-0174-3 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 4

Author(s):

Tarmo Puurand ◽

Viktoria Kukuškina ◽

Fanny-Dhelia Pajuste ◽

Maido Remm

Keyword(s):

Alignment Free ◽

Alu Element

A faster implementation of association mapping from k-mers

10.1101/2020.04.14.040675 ◽

2020 ◽

Author(s):

Zakaria Mehrab ◽

Jaiaid Mobin ◽

Ibrahim Asadullah Tahmid ◽

Atif Rahman

Keyword(s):

Association Mapping ◽

Reference Genome ◽

Association Studies ◽

Free Association ◽

Mapping Method ◽

Genome Wide Association Studies ◽

E Coli ◽

Alignment Free ◽

Genome Wide ◽

Specific Sequences

AbstractGenome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which limits their application to organisms with no or incomplete reference genomes. This caveat can be removed using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an implementation of an alignment free association mapping method [1] to improve its execution time and flexibility. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in performance over the original implementation while maintaining accuracy in results. Finally, we demonstrate that the method can be applied to find sex specific sequences.

Personal genomes in progress: from the human genome project to the personal genome project

Dialogues in Clinical Neuroscience ◽

10.31887/dcns.2010.12.1/jlunshof ◽

2010 ◽

Vol 12 (1) ◽

pp. 47-60 ◽

Cited By ~ 3

Keyword(s):

Human Genome ◽

Significant Rise ◽

Genome Project ◽

Personal Genome ◽

Research Subjects ◽

Single Nucleotide ◽

Health Records ◽

Personal Genomes ◽

The Cost ◽

The Human Genome Project

The cost of a diploid human genome sequence has dropped from about $70M to $2000 since 2007--even as the standards for redundancy have increased from 7x to 40x in order to improve call rates. Coupled with the low return on investment for common single-nucleotide polylmorphisms, this has caused a significant rise in interest in correlating genome sequences with comprehensive environmental and trait data (GET). The cost of electronic health records, imaging, and microbial, immunological, and behavioral data are also dropping quickly. Sharing such integrated GET datasets and their interpretations with a diversity of researchers and research subjects highlights the need for informed-consent models capable of addressing novel privacy and other issues, as well as for flexible data-sharing resources that make materials and data available with minimum restrictions on use. This article examines the Personal Genome Project's effort to develop a GET database as a public genomics resource broadly accessible to both researchers and research participants, while pursuing the highest standards in research ethics.

S-conLSH: Alignment-free gapped mapping of noisy long reads

10.1101/801118 ◽

2019 ◽

Author(s):

Angana Chakraborty ◽

Burkhard Morgenstern ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Sequence Data ◽

Downstream Processing ◽

The State ◽

Read Length ◽

Alignment Free ◽

Long Reads ◽

Gc Bias ◽

Target Locations

AbstractMotivationThe advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate.ResultsWe present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the state-of-the-art alignment-based methods. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing.AvailabilityThe source code of our software is freely available at https://github.com/anganachakraborty/S-conLSH

KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels

10.1101/2021.12.03.471074 ◽

2021 ◽

Author(s):

Ivar Grytten ◽

Knut D. Rand ◽

Geir K. Sandve

Keyword(s):

Genetic Variation ◽

High Throughput Sequencing ◽

Reference Genome ◽

Computational Cost ◽

Computationally Efficient ◽

Free Graph ◽

Alignment Free ◽

Recent Developments ◽

Reference Bias ◽

Short Indels

AbstractOne of the core applications of high-throughput sequencing is the characterization of individual genetic variation. Traditionally, variants have been inferred by comparing sequenced reads to a reference genome. There has recently been an emergence of genotyping methods, which instead infer variants of an individual based on variation present in population-scale repositories like the 1000 Genomes Project. However, commonly used methods for genotyping are slow since they still require mapping of reads to a reference genome. Also, since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.We here present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping. We propose two novel ideas to improve both the speed and accuracy: we (1) use known genotypes from thousands of individuals in a Bayesian model to predict genotypes, and (2) propose a computationally efficient method for leveraging correlation between variants.We show through experiments on experimental data that KAGE is both faster and more accurate than other alignment-free genotypers. KAGE is able to genotype a new sample (15x coverage) in less than half an hour on a consumer laptop, more than 10 times faster than the fastest existing methods, making it ideal in clinical settings or when large numbers of individuals are to be genotyped at low computational cost.

Efficient association mapping from k-mers—An application in finding sex-specific sequences

PLoS ONE ◽

10.1371/journal.pone.0245058 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0245058

Author(s):

Zakaria Mehrab ◽

Jaiaid Mobin ◽

Ibrahim Asadullah Tahmid ◽

Atif Rahman

Keyword(s):

Association Mapping ◽

Reference Genome ◽

Association Studies ◽

Free Association ◽

Mapping Method ◽

Genome Wide Association Studies ◽

E Coli ◽

Alignment Free ◽

Genome Wide ◽

Specific Sequences

Genome wide association studies (GWAS) attempt to map genotypes to phenotypes in organisms. This is typically performed by genotyping individuals using microarray or by aligning whole genome sequencing reads to a reference genome. Both approaches require knowledge of a reference genome which hinders their application to organisms with no or incomplete reference genomes. This caveat can be removed by using alignment-free association mapping methods based on k-mers from sequencing reads. Here we present an improved implementation of an alignment free association mapping method. The new implementation is faster and includes additional features to make it more flexible than the original implementation. We have tested our implementation on an E. Coli ampicillin resistance dataset and observe improvement in execution time over the original implementation while maintaining accuracy in results. We also demonstrate that the method can be applied to find sex specific sequences.