genome graph
Recently Published Documents


TOTAL DOCUMENTS

27
(FIVE YEARS 20)

H-INDEX

4
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Cristian Groza ◽  
Xun Chen ◽  
Alain Pacis ◽  
Marie-Michelle Simon ◽  
Albena Pramatarova ◽  
...  

Background Epigenomic experiments can be used to survey the chromatin state of the human genome and find functionally relevant sequences in given cells. However, the reference genome that is typically used to interpret these data does not account for SNPs, indels, and other structural variants present in the individual being profiled. Fortunately, population studies and whole genome sequencing can assemble tens of thousands of sequences that are not in the reference, including mobile element insertions (MEIs), which are known to influence the epigenome. We hypothesized that the use of a genome graph, which can capture this genetic diversity, could help identify more peaks and reveal notable regulatory sequences hidden by the use of a biased reference. Results Given the contributions of MEIs to the evolution of human innate immunity, we wanted to test this hypothesis in macrophages derived from 35 individuals of African and European ancestry before and after in-vitro Influenza infection. We used local assembly to resolve non-reference MEIs based on linked reads obtained from these individuals and reconstructed over five thousand Alu, over three hundred L1, and tens of SVA and ERV insertions. Next, we built a genome graph representing SNPs, indels and MEIs in these genomes and demonstrated improved read mapping sensitivity and specificity. Aligning H3K27ac and H3K4me1 ChIP-seq and ATAC-seq data on this genome graph revealed between 2 to 6 thousand novel peaks per sample. Notably, we observed hundreds of polymorphic MEIs that were marked by active histone modifications or accessible chromatin, of which 12 were associated with differential gene expression. Lastly, we found a MEI polymorphism in an active epigenomic state that is associated with the expression of TRIM25, a gene that restricts influenza RNA synthesis. Conclusion Our results demonstrate that the use of graph genomes capturing genetic variability can reveal notable regulatory regions that would have been missed by standard analytical approaches.


Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6467
Author(s):  
Haron Tinega ◽  
Enqing Chen ◽  
Long Ma ◽  
Richard M. Mariita ◽  
Divinah Nyasaka

Recently developed hybrid models that stack 3D with 2D CNN in their structure have enjoyed high popularity due to their appealing performance in hyperspectral image classification tasks. On the other hand, biological genome graphs have demonstrated their effectiveness in enhancing the scalability and accuracy of genomic analysis. We propose an innovative deep genome graph-based network (GGBN) for hyperspectral image classification to tap the potential of hybrid models and genome graphs. The GGBN model utilizes 3D-CNN at the bottom layers and 2D-CNNs at the top layers to process spectral–spatial features vital to enhancing the scalability and accuracy of hyperspectral image classification. To verify the effectiveness of the GGBN model, we conducted classification experiments on Indian Pines (IP), University of Pavia (UP), and Salinas Scene (SA) datasets. Using only 5% of the labeled data for training over the SA, IP, and UP datasets, the classification accuracy of GGBN is 99.97%, 96.85%, and 99.74%, respectively, which is better than the compared state-of-the-art methods.


2021 ◽  
Author(s):  
Martin Hunt ◽  
Brice Letcher ◽  
Kerri M Malone ◽  
Giang Nguyen ◽  
Michael B Hall ◽  
...  

Short-read variant calling for bacterial genomics is a mature field, and there are many widely-used software tools. Different underlying approaches (eg pileup, local or global assembly, paired-read use, haplotype use) lend each tool different strengths, especially when considering non-SNP (single nucleotide polymorphism) variation or potentially distant reference genomes. It would therefore be valuable to be able to integrate the results from multiple variant callers, using a robust statistical approach to "adjudicate" at loci where there is disagreement between callers. To this end, we present a tool, Minos, for variant adjudication by mapping reads to a genome graph of variant calls. Minos allows users to combine output from multiple variant callers without loss of precision. Minos also addresses a second problem of joint genotyping SNPs and indels in bacterial cohorts, which can also be framed as an adjudication problem. We benchmark on 62 samples from 3 species (Mycobacterium tuberculosis, Staphylococcus aureus, Klebsiella pneumoniae) and an outbreak of 385 M. tuberculosis samples. Finally, we joint genotype a large M. tuberculosis cohort (N≈15k) for which the rifampicin phenotype is known. We build a map of non-synonymous variants in the RRDR (rifampicin resistance determining region) of the rpoB gene and extend current knowledge relating RRDR SNPs to heterogeneity in rifampicin resistance levels. We replicate this finding in a second M. tuberculosis cohort (N≈13k). Minos is released under the MIT license, available at https://github.com/iqbal-lab-org/minos.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rachel M. Colquhoun ◽  
Michael B. Hall ◽  
Leandro Lima ◽  
Leah W. Roberts ◽  
Kerri M. Malone ◽  
...  

AbstractWe present pandora, a novel pan-genome graph structure and algorithms for identifying variants across the full bacterial pan-genome. As much bacterial adaptability hinges on the accessory genome, methods which analyze SNPs in just the core genome have unsatisfactory limitations. Pandora approximates a sequenced genome as a recombinant of references, detects novel variation and pan-genotypes multiple samples. Using a reference graph of 578 Escherichia coli genomes, we compare 20 diverse isolates. Pandora recovers more rare SNPs than single-reference-based tools, is significantly better than picking the closest RefSeq reference, and provides a stable framework for analyzing diverse samples without reference bias.


F1000Research ◽  
2021 ◽  
Vol 8 ◽  
pp. 1751
Author(s):  
Bastien Llamas ◽  
Giuseppe Narzisi ◽  
Valerie Schneider ◽  
Peter A. Audano ◽  
Evan Biederstedt ◽  
...  

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.


2021 ◽  
Author(s):  
Yutong Qiu ◽  
Carl Kingsford

AbstractThe size of a genome graph — the space required to store the nodes, their labels and edges — affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. The size of the graph also affects the size of the graph index that is used to speed up the alignment. This raises the need for approaches to construct space-efficient genome graphs.We point out similarities in the string encoding approaches of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed based on an optimal EPM compression. In addition to the transformation, we show that equivalent choices made by EPM compression algorithms may result in different sizes of genome graphs. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv EPM compression algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored de Bruijn graphs constructed by Bifrost under the default settings.The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph


2021 ◽  
Author(s):  
Brice Letcher ◽  
Martin Hunt ◽  
Zamin Iqbal

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.


2020 ◽  
Author(s):  
Daniel Danciu ◽  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
André Kahles ◽  
Gunnar Rätsch

AbstractSince the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of nodes adjacent in the graph. RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time. RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation. Our experiments on the Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST, the previously known smallest annotation representation. In addition, experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Cervin Guyomar ◽  
Wesley Delage ◽  
Fabrice Legeai ◽  
Christophe Mougel ◽  
Jean-Christophe Simon ◽  
...  

Abstract Most metazoans are associated with symbionts. Characterizing the effect of a particular symbiont often requires getting access to its genome, which is usually done by sequencing the whole community. We present MinYS, a targeted assembly approach to assemble a particular genome of interest from such metagenomic data. First, taking advantage of a reference genome, a subset of the reads is assembled into a set of backbone contigs. Then, this draft assembly is completed using the whole metagenomic readset in a de novo manner. The resulting assembly is output as a genome graph, enabling different strains with potential structural variants coexisting in the sample to be distinguished. MinYS was applied to 50 pea aphid resequencing samples, with variable diversity in symbiont communities, in order to recover the genome sequence of its obligatory bacterial symbiont, Buchnera aphidicola. It was able to return high-quality assemblies (one contig assembly in 90% of the samples), even when using increasingly distant reference genomes, and to retrieve large structural variations in the samples. Because of its targeted essence, it outperformed standard metagenomic assemblers in terms of both time and assembly quality.


2020 ◽  
Vol 10 (1) ◽  
pp. 82-96
Author(s):  
Kadir Dede ◽  
Enno Ohlebusch

AbstractMarcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph as a description of a pan-genome, comprising the genomes of many individuals/strains of the same or closely related species. Subsequent work improved the construction of the compressed de Bruijn graph in terms of run-time and memory consumption. According to the Computational Pan-Genomics Consortium (Briefings in Bioinformatics 2016), a pan-genome data structure should support the following functionality: “All information within a data structure should be easily accessible for human eyes by visualization support on different scales.” However, a pan-genome graph can have thousands to millions of nodes and such an amount of information is certainly not easily accessible for human eyes. Thus, the possibility to construct pangenome subgraphs on demand would be quite valuable. In this article, we use the space-efficient representation of the compressed de Bruijn graph devised by Beller and Ohle-busch (Algorithms for Molecular Biology 2016) to construct pan-genome subgraphs on the fly. The user can specify a region in one of the genomes and the software tool will build a subgraph that contains the path corresponding to that region and all paths that are in the neighborhood of that path. The size of the neighborhood can be controlled by the user.


Sign in / Sign up

Export Citation Format

Share Document