scholarly journals An improved encoding of genetic variation in a Burrows-Wheeler transform

2019 ◽  
Author(s):  
Thomas Büchler ◽  
Enno Ohlebusch

AbstractMotivationIn resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers (Li and Durbin, 2009; Langmead and Salzberg, 2012) are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. (2013) encoded SNPs in a BWT by the IUPAC nucleotide code (Cornish-Bowden, 1985). In a different approach, Maciuca et al. (2016) provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation.ResultsIn this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, MNPs, indels, duplications, transpositions, inversions, and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in (Huang et al., 2013) and the encoding of the other kinds of genetic variation relies on the idea introduced in (Maciuca et al., 2016). In contrast to Maciuca et al. (2016), however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it to BWBBLE (Huang et al., 2013) and gramtools (Maciuca et al., 2016).Availabilityhttps://www.uni-ulm.de/in/theo/research/seqana/Contact:[email protected]


2019 ◽  
Author(s):  
Thomas Büchler ◽  
Enno Ohlebusch

Abstract Motivation In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows–Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. Results In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361–i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222–233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. Availability and implementation https://www.uni-ulm.de/in/theo/research/seqana/. Contact [email protected]



2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Xin Shao ◽  
Ning Lv ◽  
Jie Liao ◽  
Jinbo Long ◽  
Rui Xue ◽  
...  

Abstract Background Cancer is a heterogeneous disease with many genetic variations. Lines of evidence have shown copy number variations (CNVs) of certain genes are involved in development and progression of many cancers through the alterations of their gene expression levels on individual or several cancer types. However, it is not quite clear whether the correlation will be a general phenomenon across multiple cancer types. Methods In this study we applied a bioinformatics approach integrating CNV and differential gene expression mathematically across 1025 cell lines and 9159 patient samples to detect their potential relationship. Results Our results showed there is a close correlation between CNV and differential gene expression and the copy number displayed a positive linear influence on gene expression for the majority of genes, indicating that genetic variation generated a direct effect on gene transcriptional level. Another independent dataset is utilized to revalidate the relationship between copy number and expression level. Further analysis show genes with general positive linear influence on gene expression are clustered in certain disease-related pathways, which suggests the involvement of CNV in pathophysiology of diseases. Conclusions This study shows the close correlation between CNV and differential gene expression revealing the qualitative relationship between genetic variation and its downstream effect, especially for oncogenes and tumor suppressor genes. It is of a critical importance to elucidate the relationship between copy number variation and gene expression for prevention, diagnosis and treatment of cancer.



2021 ◽  
Vol 12 ◽  
Author(s):  
Manuela Moraru ◽  
Adriana Perez-Portilla ◽  
Karima Al-Akioui Sanz ◽  
Alfonso Blazquez-Moreno ◽  
Antonio Arnaiz-Villena ◽  
...  

Fcγ receptors (FcγR), cell-surface glycoproteins that bind antigen-IgG complexes, control both humoral and cellular immune responses. The FCGR locus on chromosome 1q23.3 comprises five homologous genes encoding low-affinity FcγRII and FcγRIII, and displays functionally relevant polymorphism that impacts on human health. Recurrent events of non-allelic homologous recombination across the FCGR locus result in copy-number variation of ~82.5 kbp-long fragments known as copy-number regions (CNR). Here, we characterize a recently described deletion that we name CNR5, which results in loss of FCGR3A, FCGR3B, and FCGR2C, and generation of a recombinant FCGR3B/A gene. We show that the CNR5 recombination spot lies at the beginning of the third FCGR3 intron. Although the FCGR3B/A-encoded hybrid protein CD16B/A reaches the plasma membrane in transfected cells, its possible natural expression, predictably restricted to neutrophils, could not be demonstrated in resting or interferon γ-stimulated cells. As the CNR5-deletion was originally described in an Ecuadorian family from Llano Grande (an indigenous community in North-Eastern Quito), we characterized the FCGR genetic variation in two populations from the highlands of Ecuador. Our results reveal that CNR5-deletion is relatively frequent in Llano Grande (5 carriers out of 36 donors). Furthermore, we found a high frequency of two strong-phagocytosis variants: the FCGR3B-NA1 haplotype and the CNR1 duplication, which translates into an increased FCGR3B and FCGR2C copy-number. CNR1 duplication was particularly increased in Llano Grande, 77.8% of the studied sample carrying at least one such duplication. In contrast, an extended haplotype CD16A-176V – CD32C-ORF+2B.2 – CD32B-2B.4 including strong activating and inhibitory FcγR variants was absent in Llano Grande and found at a low frequency (8.6%) in Ecuador highlands. This particular distribution of FCGR polymorphism, possibly a result of selective pressures, further confirms the importance of a comprehensive, joint analysis of all genetic variations in the locus and warrants additional studies on their putative clinical impact. In conclusion, our study confirms important ethnic variation at the FCGR locus; it shows a distinctive FCGR polymorphism distribution in Ecuador highlands; provides a molecular characterization of a novel CNR5-deletion associated with CD16A and CD16B deficiency; and confirms its presence in that population.



Blood ◽  
2009 ◽  
Vol 113 (19) ◽  
pp. 4512-4520 ◽  
Author(s):  
Deborah French ◽  
Wenjian Yang ◽  
Cheng Cheng ◽  
Susana C. Raimondi ◽  
Charles G. Mullighan ◽  
...  

Abstract Methotrexate polyglutamates (MTXPGs) determine in vivo efficacy in acute lymphoblastic leukemia (ALL). MTXPG accumulation differs by leukemic subtypes, but genomic determinants of MTXPG variation in ALL remain unclear. We analyzed 3 types of whole genome variation: leukemia cell gene expression and somatic copy number variation, and inherited single nucleotide polymorphism (SNP) genotypes and determined their association with MTXPGs in leukemia cells. Seven genes (FHOD3, IMPA2, ME2, RASSF4, SLC39A6, SMAD2, and SMAD4) displayed all 3 types of genomic variation associated with MTXPGs (P < .05 for gene expression, P < .01 for copy number variation and SNPs): 6 on chromosome 18 and 1 on chromosome 10. Increased chromosome 18 (P = .002) or 10 (P = .036) copy number was associated with MTXPGs even after adjusting for ALL subtype. The expression of the top 7 genes in leukemia cells accounted for more variation in MTXPGs (46%) than did the expression of the top 7 genes in normal HapMap cell lines (20%). The top 7 inherited SNPs in patients accounted for approximately the same degree of variation (17%) in MTXPGs as did the top 7 SNP genotypes in HapMap cell lines (20%). We conclude that acquired genetic variation in leukemia cells has a stronger influence on MTXPG accumulation than inherited genetic variation.



2013 ◽  
Vol 20 (3) ◽  
pp. 224-236 ◽  
Author(s):  
Zhanyong Wang ◽  
Farhad Hormozdiari ◽  
Wen-Yun Yang ◽  
Eran Halperin ◽  
Eleazar Eskin


2016 ◽  
Author(s):  
Sorina Maciuca ◽  
Carlos del Ojo Elias ◽  
Gil McVean ◽  
Zamin Iqbal

AbstractWe show how positional markers can be used to encode genetic variation within aBurrows-Wheeler Transform (BWT), and use this to construct a generalisation ofthe traditional “reference genome”, incorporating known variation within aspecies. Our goal is to support the inference of the closest mosaic of previouslyknown sequences to the genome(s) under analysis.Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.



2010 ◽  
Vol 92 (2) ◽  
pp. 115-125 ◽  
Author(s):  
M. P. L. CALUS ◽  
D. J. DE KONING ◽  
C. S. HALEY

SummaryThe objective of this study was to investigate, both empirically and deterministically, the ability to explain genetic variation resulting from a copy number polymorphism (CNP) by including the CNP, either by its genotype or by a continuous derivation thereof, alone or together with a nearby single nucleotide polymorphism (SNP) in the model. This continuous measure of a CNP genotype could be a raw hybridization measurement, or a predicted CNP genotype. Results from simulations showed that the linkage disequilibrium (LD) between an SNP and CNP was lower than LD between two SNPs, due to the higher mutation rate at the CNP loci. The model R2 values from analysing the simulated data were very similar to the R2 values predicted with the deterministic formulae. Under the assumption that x copies at a CNP locus lead to the effect of x times the effect of 1 copy, including a continuous measure of a CNP locus in the model together with the genotype of a nearby SNP increased power to explain variation at the CNP locus, even when the continuous measure explained only 15% of the variation at the CNP locus.



2020 ◽  
Vol 21 (9) ◽  
pp. 3296 ◽  
Author(s):  
Syed K. Rafi ◽  
Merlin G. Butler

The 15q11.2 BP1-BP2 microdeletion (Burnside–Butler) syndrome is emerging as the most frequent pathogenic copy number variation (CNV) in humans associated with neurodevelopmental disorders with changes in brain morphology, behavior, and cognition. In this study, we explored functions and interactions of the four protein-coding genes in this region, namely NIPA1, NIPA2, CYFIP1, and TUBGCP5, and elucidate their role, in solo and in concert, in the causation of neurodevelopmental disorders. First, we investigated the STRING protein-protein interactions encompassing all four genes and ascertained their predicted Gene Ontology (GO) functions, such as biological processes involved in their interactions, pathways and molecular functions. These include magnesium ion transport molecular function, regulation of axonogenesis and axon extension, regulation and production of bone morphogenetic protein and regulation of cellular growth and development. We gathered a list of significantly associated cardinal maladies for each gene from searchable genomic disease websites, namely MalaCards.org: HGMD, OMIM, ClinVar, GTR, Orphanet, DISEASES, Novoseek, and GeneCards.org. Through tabulations of such disease data, we ascertained the cardinal disease association of each gene, as well as their expanded putative disease associations. This enabled further tabulation of disease data to ascertain the role of each gene in the top ten overlapping significant neurodevelopmental disorders among the disease association data sets: (1) Prader–Willi Syndrome (PWS); (2) Angelman Syndrome (AS); (3) 15q11.2 Deletion Syndrome with Attention Deficit Hyperactive Disorder & Learning Disability; (4) Autism Spectrum Disorder (ASD); (5) Schizophrenia; (6) Epilepsy; (7) Down Syndrome; (8) Microcephaly; (9) Developmental Disorder, and (10) Peripheral Nervous System Disease. The cardinal disease associations for each of the four contiguous 15q11.2 BP1-BP2 genes are NIPA1- Spastic Paraplegia 6; NIPA2—Angelman Syndrome and Prader–Willi Syndrome; CYFIP1—Fragile X Syndrome and Autism; TUBGCP5—Prader–Willi Syndrome. The four genes are individually associated with PWS, ASD, schizophrenia, epilepsy, and Down syndrome. Except for TUBGCP5, the other three genes are associated with AS. Unlike the other genes, TUBGCP5 is also not associated with attention deficit hyperactivity disorder and learning disability, developmental disorder, or peripheral nervous system disease. CYFIP1 was the only gene not associated with microcephaly but was the only gene associated with developmental disorders. Collectively, all four genes were associated with up to three-fourths of the ten overlapping neurodevelopmental disorders and are deleted in this most prevalent known pathogenic copy number variation now recognized among humans with these clinical findings.



Author(s):  
Zhanyong Wang ◽  
Farhad Hormozdiari ◽  
Wen-Yun Yang ◽  
Eran Halperin ◽  
Eleazar Eskin


Sign in / Sign up

Export Citation Format

Share Document