scholarly journals Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

GigaScience ◽  
2020 ◽  
Vol 9 (5) ◽  
Author(s):  
Morteza Hosseini ◽  
Diogo Pratas ◽  
Burkhard Morgenstern ◽  
Armando J Pinho

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Author(s):  
Morteza Hosseini ◽  
Diogo Pratas ◽  
Burkhard Morgenstern ◽  
Armando J. Pinho

AbstractBackgroundThe development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial due to their role in chromosomal evolution, genetic disorders and cancer;ResultsWe present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between two DNA sequences. This computational solution extracts information contents of the two sequences, exploiting a data compression technique, in order for finding rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image;ConclusionsTested on several synthetic and real DNA sequences from bacteria, fungi, Aves and mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions complied with previous studies which took alignment-based approaches or performed FISH (Fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ~1 GB, which makes Smash++ feasible to run on present-day standard computers.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Sen Liu ◽  
Yuping Wang ◽  
Wuning Tong ◽  
Shiwei Wei

Abstract Motivation Multiple longest common subsequence (MLCS) problem is searching all longest common subsequences of multiple character sequences. It appears in many fields such as data mining, DNA alignment, bioinformatics, text editing and so on. With the increasing in sequence length and number of sequences, the existing dynamic programming algorithms and the dominant point-based algorithms become ineffective and inefficient, especially for large-scale MLCS problems. Results In this paper, by considering the characteristics of DNA sequences with many consecutively repeated characters, we first design a character merging scheme which merges the consecutively repeated characters in the sequences. As a result, it shortens the length of sequences considered and saves the space of storing all sequences. To further reduce the space and time costs, we construct a weighted directed acyclic graph which is much smaller than widely used directed acyclic graph for MLCS problems. Based on these techniques, we propose a fast and memory efficient algorithm for MLCS problems. Finally, the experiments are conducted and the proposed algorithm is compared with several state-of-the art algorithms. The experimental results show that the proposed algorithm performs better than the compared state-of-the art algorithms in both time and space costs. Availability and implementation https://www.ncbi.nlm.nih.gov/nuccore and https://github.com/liusen1006/MLCS.


2018 ◽  
Vol 115 (27) ◽  
pp. E6217-E6226 ◽  
Author(s):  
John A. Hawkins ◽  
Stephen K. Jones ◽  
Ilya J. Finkelstein ◽  
William H. Press

Many large-scale, high-throughput experiments use DNA barcodes, short DNA sequences prepended to DNA libraries, for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely used error-correcting codes borrowed from computer science (e.g., Hamming, Levenshtein codes) do not properly account for insertions and deletions (indels) in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate filled/truncated right end edit (FREE) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced guanine-cytosine (GC) content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error correction levels that may be useful in diverse high-throughput applications, including >106 single-error–correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with >1015 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community.


2018 ◽  
Author(s):  
Benjamin T. James ◽  
Brian B. Luczak ◽  
Hani Z. Girgis

AbstractMotivationPairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models.ResultsFor the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool.AvailabilityFASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset [email protected] informationSupplementary data are available online.


BMC Biology ◽  
2019 ◽  
Vol 17 (1) ◽  
Author(s):  
Amrita Srivathsan ◽  
Emily Hartop ◽  
Jayanthi Puniamoorthy ◽  
Wan Ting Lee ◽  
Sujatha Narayanan Kutty ◽  
...  

Abstract Background More than 80% of all animal species remain unknown to science. Most of these species live in the tropics and belong to animal taxa that combine small body size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted based on detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for sorting specimens to species. Morphological verification of such “molecular operational taxonomic units” (mOTUs) could then be based on dissection of a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well-equipped, well-funded molecular laboratories are not readily available in many biodiverse countries. Results We here document how MinION sequencing can be used for large-scale species discovery in a specimen- and species-rich taxon like the hyperdiverse fly family Phoridae (Diptera). We sequenced 7059 specimens collected in a single Malaise trap in Kibale National Park, Uganda, over the short period of 8 weeks. We discovered > 650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using an improved low-cost MinION pipeline that increased the barcoding capacity sevenfold from 500 to 3500 barcodes per flowcell. This was achieved by adopting 1D sequencing, resequencing weak amplicons on a used flowcell, and improving demultiplexing. Comparison with Illumina data revealed that the MinION barcodes were very accurate (99.99% accuracy, 0.46% Ns) and thus yielded very similar species units (match ratio 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with morphology (93% of mOTUs; > 99% of specimens) and revealed that 90% of the putative species belong to the neglected, megadiverse genus Megaselia. We demonstrate for one Megaselia species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.). Conclusions We document that one field site in Africa can be home to an estimated 1000 species of phorids and speculate that the Afrotropical diversity could exceed 200,000 species. We furthermore conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing could quickly reveal the extent of the unknown diversity and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.


Diversity ◽  
2019 ◽  
Vol 11 (12) ◽  
pp. 234 ◽  
Author(s):  
Eric A. Griffin ◽  
Joshua G. Harrison ◽  
Melissa K. McCormick ◽  
Karin T. Burghardt ◽  
John D. Parker

Although decades of research have typically demonstrated a positive correlation between biodiversity of primary producers and associated trophic levels, the ecological drivers of this association are poorly understood. Recent evidence suggests that the plant microbiome, or the fungi and bacteria found on and inside plant hosts, may be cryptic yet important drivers of important processes, including primary production and trophic interactions. Here, using high-throughput sequencing, we characterized foliar fungal community diversity, composition, and function from 15 broadleaved tree species (N = 545) in a recently established, large-scale temperate tree diversity experiment using over 17,000 seedlings. Specifically, we tested whether increases in tree richness and phylogenetic diversity would increase fungal endophyte diversity (the “Diversity Begets Diversity” hypothesis), as well as alter community composition (the “Tree Diversity–Endophyte Community” hypothesis) and function (the “Tree Diversity–Endophyte Function” hypothesis) at different spatial scales. We demonstrated that increasing tree richness and phylogenetic diversity decreased fungal species and functional guild richness and diversity, including pathogens, saprotrophs, and parasites, within the first three years of a forest diversity experiment. These patterns were consistent at the neighborhood and tree plot scale. Our results suggest that fungal endophytes, unlike other trophic levels (e.g., herbivores as well as epiphytic bacteria), respond negatively to increasing plant diversity.


2019 ◽  
Vol 201 (17) ◽  
Author(s):  
Dragutin J. Savic ◽  
Scott V. Nguyen ◽  
Kimberly McCullor ◽  
W. Michael McShan

ABSTRACTA large-scale genomic inversion encompassing 0.79 Mb of the 1.816-Mb-longStreptococcus pyogenesserotype M49 strain NZ131 chromosome spontaneously occurs in a minor subpopulation of cells, and in this report genetic selection was used to obtain a stable lineage with this chromosomal rearrangement. This inversion, which drastically displaces theorisite relative to the terminus, changes the relative length of the replication arms so that one replichore is approximately 0.41 Mb while the other is about 1.40 Mb in length. Genomic reversion to the original chromosome constellation is not observed in PCR-monitored analyses after 180 generations of growth in rich medium. Compared to the parental strain, the inversion surprisingly demonstrates a nearly identical growth pattern in the first phase of the exponential phase, but differences do occur when resources in the medium become limited. When cultured separately in rich medium during prolonged stationary phase or in an experimental acute infection animal model (Galleria mellonella), the parental strain and the invertant have equivalent survival rates. However, when they are coincubated together, bothin vitroandin vivo, the survival of the invertant declines relative to the level for the parental strain. The accompanying aspect of the study suggests that inversions taking place nearoriCalways happen to secure the linkage oforiCto DNA sequences responsible for chromosome partition. The biological relevance of large-scale inversions is also discussed.IMPORTANCEBased on our previous work, we created to our knowledge the largest asymmetric inversion, covering 43.5% of theS. pyogenesgenome. In spite of a drastic replacement of origin of replication and the unbalanced size of replichores (1.4 Mb versus 0.41 Mb), the invertant, when not challenged with its progenitor, showed impressive vitality for growthin vitroand in pathogenesis assays. The mutant supports the existing idea that slightly deleterious mutations can provide the setting for secondary adaptive changes. Furthermore, comparative analysis of the mutant with previously published data strongly indicates that even large genomic rearrangements survive provided that the integrity of theoriCand the chromosome partition cluster is preserved.


2021 ◽  
pp. 1-9
Author(s):  
Paulo E.A.S. Câmara ◽  
Láuren M.D. De Souza ◽  
Otávio Henrique Bezerra Pinto ◽  
Peter Convey ◽  
Eduardo T. Amorim ◽  
...  

Abstract Antarctic lakes have generally simple periphyton communities when compared with those of lower latitudes. To date, assessment of microbial diversity in Antarctica has relied heavily on traditional direct observation and cultivation methods. In this study, sterilized cotton baits were left submerged for two years in two lakes on King George Island and Deception Island, South Shetland Islands (Maritime Antarctic), followed by assessment of diversity by metabarcoding using high-throughput sequencing. DNA sequences of 44 taxa belonging to four kingdoms and seven phyla were found. Thirty-six taxa were detected in Hennequin Lake on King George Island and 20 taxa were detected in Soto Lake on Deception Island. However, no significant difference in species composition was detected between the two assemblages (Shannon index). Our data suggest that metabarcoding provides a suitable method for the assessment of periphyton biodiversity in oligotrophic Antarctic lakes.


Sign in / Sign up

Export Citation Format

Share Document