PARALLEL ALGORITHMS FOR MAPPING SHORT DEGENERATE AND WEIGHTED DNA SEQUENCES TO A REFERENCE GENOME

One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) in a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these sequences to a reference genome are in great demand, and recently, there has been ample work for publishing such algorithms. One important feature of these algorithms is the support of multithreaded parallel computing in order to speedup the mapping process. In this paper, we design parallel algorithms, which make use of the message-passing parallelism model, to address this problem efficiently. The proposed algorithms also take into consideration the probability scores assigned to each base for occurring in a specific position of a sequence. In particular, we present parallel algorithms for mapping short degenerate and weighted DNA sequences to a reference genome.

Download Full-text

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

International Journal of Genomics ◽

10.1155/2015/563482 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Krisztian Buza ◽

Bartek Wilczynski ◽

Norbert Dojer

Keyword(s):

Reference Genome ◽

De Novo ◽

Real Data ◽

Reference Sequence ◽

Individual Genome ◽

Single Experiment ◽

Sequencing Technologies ◽

Sequencing Cost ◽

The Individual ◽

Assembly Software

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

Download Full-text

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

10.1101/068338 ◽

2016 ◽

Cited By ~ 4

Author(s):

Shaun D Jackman ◽

Benjamin P Vandervalk ◽

Hamid Mohamadi ◽

Justin Chu ◽

Sarah Yeo ◽

...

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Message Passing ◽

Large Scale ◽

De Novo ◽

Bloom Filter ◽

Genomic Variation ◽

De Bruijn Graph ◽

Single Individual ◽

Probabilistic Data Structure

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Almost 20 years of Neanderthal palaeogenetics: adaptation, admixture, diversity, demography and extinction

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2013.0374 ◽

2015 ◽

Vol 370 (1660) ◽

pp. 20130374 ◽

Cited By ~ 23

Author(s):

Federico Sánchez-Quinto ◽

Carles Lalueza-Fox

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Complete Mitochondrial Genome ◽

Modern Human ◽

Small Population ◽

High Coverage ◽

Spatial And Temporal Scales ◽

Hybridization Capture ◽

Sequencing Technologies

Nearly two decades since the first retrieval of Neanderthal DNA, recent advances in next-generation sequencing technologies have allowed the generation of high-coverage genomes from two archaic hominins, a Neanderthal and a Denisovan, as well as a complete mitochondrial genome from remains which probably represent early members of the Neanderthal lineage. This genomic information, coupled with diversity exome data from several Neanderthal specimens is shedding new light on evolutionary processes such as the genetic basis of Neanderthal and modern human-specific adaptations—including morphological and behavioural traits—as well as the extent and nature of the admixture events between them. An emerging picture is that Neanderthals had a long-term small population size, lived in small and isolated groups and probably practised inbreeding at times. Deleterious genetic effects associated with these demographic factors could have played a role in their extinction. The analysis of DNA from further remains making use of new large-scale hybridization-capture-based methods as well as of new approaches to discriminate contaminant DNA sequences will provide genetic information in spatial and temporal scales that could help clarify the Neanderthal's—and our very own—evolutionary history.

Download Full-text

Rapid, large-scale species discovery in hyperdiverse taxa using 1D MinION sequencing

BMC Biology ◽

10.1186/s12915-019-0706-9 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 17

Author(s):

Amrita Srivathsan ◽

Emily Hartop ◽

Jayanthi Puniamoorthy ◽

Wan Ting Lee ◽

Sujatha Narayanan Kutty ◽

...

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Low Cost ◽

Small Body ◽

Small Subset ◽

Similar Species ◽

Large Species ◽

Morphological Examination ◽

Species Discovery ◽

Short Period

Abstract Background More than 80% of all animal species remain unknown to science. Most of these species live in the tropics and belong to animal taxa that combine small body size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted based on detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for sorting specimens to species. Morphological verification of such “molecular operational taxonomic units” (mOTUs) could then be based on dissection of a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well-equipped, well-funded molecular laboratories are not readily available in many biodiverse countries. Results We here document how MinION sequencing can be used for large-scale species discovery in a specimen- and species-rich taxon like the hyperdiverse fly family Phoridae (Diptera). We sequenced 7059 specimens collected in a single Malaise trap in Kibale National Park, Uganda, over the short period of 8 weeks. We discovered > 650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using an improved low-cost MinION pipeline that increased the barcoding capacity sevenfold from 500 to 3500 barcodes per flowcell. This was achieved by adopting 1D sequencing, resequencing weak amplicons on a used flowcell, and improving demultiplexing. Comparison with Illumina data revealed that the MinION barcodes were very accurate (99.99% accuracy, 0.46% Ns) and thus yielded very similar species units (match ratio 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with morphology (93% of mOTUs; > 99% of specimens) and revealed that 90% of the putative species belong to the neglected, megadiverse genus Megaselia. We demonstrate for one Megaselia species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.). Conclusions We document that one field site in Africa can be home to an estimated 1000 species of phorids and speculate that the Afrotropical diversity could exceed 200,000 species. We furthermore conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing could quickly reveal the extent of the unknown diversity and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.

Download Full-text

Systematic benchmark of ancient DNA read mapping

Briefings in Bioinformatics ◽

10.1093/bib/bbab076 ◽

2021 ◽

Author(s):

Adrien Oliva ◽

Raymond Tobler ◽

Alan Cooper ◽

Bastien Llamas ◽

Yassine Souilmi

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Population Genetic ◽

Reference Genome ◽

Population Data ◽

Human Populations ◽

Current Standard ◽

Read Mapping ◽

Reference Bias ◽

The Impact

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

Hybrid Graph Neural Networks for Crowd Counting

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6839 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11693-11700 ◽

Cited By ~ 2

Author(s):

Ao Luo ◽

Fan Yang ◽

Xin Li ◽

Dong Nie ◽

Zhicheng Jiao ◽

...

Keyword(s):

Network Architecture ◽

Message Passing ◽

Large Scale ◽

State Of The Art ◽

Density Variation ◽

Feature Maps ◽

Crowd Counting ◽

Multi Scale ◽

Crowd Density ◽

Graph Neural Networks

Crowd counting is an important yet challenging task due to the large scale and density variation. Recent investigations have shown that distilling rich relations among multi-scale features and exploiting useful information from the auxiliary task, i.e., localization, are vital for this task. Nevertheless, how to comprehensively leverage these relations within a unified network architecture is still a challenging problem. In this paper, we present a novel network structure called Hybrid Graph Neural Network (HyGnn) which targets to relieve the problem by interweaving the multi-scale features for crowd density as well as its auxiliary task (localization) together and performing joint reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to jointly represent the task-specific feature maps of different scales as nodes, and two types of relations as edges: (i) multi-scale relations capturing the feature dependencies across scales and (ii) mutual beneficial relations building bridges for the cooperation between counting and localization. Thus, through message passing, HyGnn can capture and distill richer relations between nodes to obtain more powerful representations, providing robust and accurate results. Our HyGnn performs significantly well on four challenging datasets: ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF_QNRF, outperforming the state-of-the-art algorithms by a large margin.

Download Full-text

Biological Impact of a Large-Scale Genomic Inversion That Grossly Disrupts the Relative Positions of the Origin and Terminus Loci of theStreptococcus pyogenesChromosome

Journal of Bacteriology ◽

10.1128/jb.00090-19 ◽

2019 ◽

Vol 201 (17) ◽

Cited By ~ 1

Author(s):

Dragutin J. Savic ◽

Scott V. Nguyen ◽

Kimberly McCullor ◽

W. Michael McShan

Keyword(s):

Dna Sequences ◽

Parental Strain ◽

Large Scale ◽

Galleria Mellonella ◽

Acute Infection ◽

Relative Length ◽

Published Data ◽

Rich Medium ◽

Content Type

ABSTRACTA large-scale genomic inversion encompassing 0.79 Mb of the 1.816-Mb-longStreptococcus pyogenesserotype M49 strain NZ131 chromosome spontaneously occurs in a minor subpopulation of cells, and in this report genetic selection was used to obtain a stable lineage with this chromosomal rearrangement. This inversion, which drastically displaces theorisite relative to the terminus, changes the relative length of the replication arms so that one replichore is approximately 0.41 Mb while the other is about 1.40 Mb in length. Genomic reversion to the original chromosome constellation is not observed in PCR-monitored analyses after 180 generations of growth in rich medium. Compared to the parental strain, the inversion surprisingly demonstrates a nearly identical growth pattern in the first phase of the exponential phase, but differences do occur when resources in the medium become limited. When cultured separately in rich medium during prolonged stationary phase or in an experimental acute infection animal model (Galleria mellonella), the parental strain and the invertant have equivalent survival rates. However, when they are coincubated together, bothin vitroandin vivo, the survival of the invertant declines relative to the level for the parental strain. The accompanying aspect of the study suggests that inversions taking place nearoriCalways happen to secure the linkage oforiCto DNA sequences responsible for chromosome partition. The biological relevance of large-scale inversions is also discussed.IMPORTANCEBased on our previous work, we created to our knowledge the largest asymmetric inversion, covering 43.5% of theS. pyogenesgenome. In spite of a drastic replacement of origin of replication and the unbalanced size of replichores (1.4 Mb versus 0.41 Mb), the invertant, when not challenged with its progenitor, showed impressive vitality for growthin vitroand in pathogenesis assays. The mutant supports the existing idea that slightly deleterious mutations can provide the setting for secondary adaptive changes. Furthermore, comparative analysis of the mutant with previously published data strongly indicates that even large genomic rearrangements survive provided that the integrity of theoriCand the chromosome partition cluster is preserved.

Download Full-text

Variability of the Sheep Lung Microbiota

Applied and Environmental Microbiology ◽

10.1128/aem.00540-16 ◽

2016 ◽

Vol 82 (11) ◽

pp. 3225-3238 ◽

Cited By ~ 27

Author(s):

Laura Glendinning ◽

Steven Wright ◽

Jolinda Pollock ◽

Peter Tennant ◽

David Collie ◽

...

Keyword(s):

Spatial Variability ◽

Dna Sequences ◽

Bacterial Communities ◽

16S Rrna Genes ◽

Rrna Genes ◽

Large Animal ◽

Local Factors ◽

Adult Sheep ◽

Sequencing Technologies ◽

Host Influence

ABSTRACTSequencing technologies have recently facilitated the characterization of bacterial communities present in lungs during health and disease. However, there is currently a dearth of information concerning the variability of such data in health both between and within subjects. This study seeks to examine such variability using healthy adult sheep as our model system. Protected specimen brush samples were collected from three spatially disparate segmental bronchi of six adult sheep (age, 20 months) on three occasions (day 0, 1 month, and 3 months). To further explore the spatial variability of the microbiotas, more-extensive brushing samples (n= 16) and a throat swab were taken from a separate sheep. The V2 and V3 hypervariable regions of the bacterial 16S rRNA genes were amplified and sequenced via Illumina MiSeq. DNA sequences were analyzed using the mothur software package. Quantitative PCR was performed to quantify total bacterial DNA. Some sheep lungs contained dramatically different bacterial communities at different sampling sites, whereas in others, airway microbiotas appeared similar across the lung. In our spatial variability study, we observed clustering related to the depth within the lung from which samples were taken. Lung depth refers to increasing distance from the glottis, progressing in a caudal direction. We conclude that both host influence and local factors have impacts on the composition of the sheep lung microbiota.IMPORTANCEUntil recently, it was assumed that the lungs were a sterile environment which was colonized by microbes only during disease. However, recent studies using sequencing technologies have found that there is a small population of bacteria which exists in the lung during health, referred to as the “lung microbiota.” In this study, we characterize the variability of the lung microbiotas of healthy sheep. Sheep not only are economically important animals but also are often used as large animal models of human respiratory disease. We conclude that, while host influence does play a role in dictating the types of microbes which colonize the airways, it is clear that local factors also play an important role in this regard. Understanding the nature and influence of these factors will be key to understanding the variability in, and functional relevance of, the lung microbiota.

Download Full-text