scholarly journals Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters

2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Maleeha Najam ◽  
Raihan Ur Rasool ◽  
Hafiz Farooq Ahmad ◽  
Usman Ashraf ◽  
Asad Waqar Malik

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.

Genetics ◽  
1993 ◽  
Vol 134 (4) ◽  
pp. 1195-1204
Author(s):  
S Tarès ◽  
J M Cornuet ◽  
P Abad

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.


2018 ◽  
Vol 29 (08) ◽  
pp. 1249-1255
Author(s):  
Kamil Salikhov

Modern DNA sequencing technologies generate prodigious volumes of sequence data consisting of short DNA fragments (reads). Storing and transferring this data is often challenging. With this motivation, several specialized compression methods have been developed. In this paper, we present an improvement of the lossless reference-free compression algorithm, suggested by Rozov et al., based on the technique of cascading Bloom filters. Through computational experiments on real data, we demonstrate that our method results in a significant associated memory reduction in practice.


Zootaxa ◽  
2012 ◽  
Vol 3361 (1) ◽  
pp. 56-62 ◽  
Author(s):  
JOSEFINA CURIEL ◽  
JUAN J. MORRONE

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.


2000 ◽  
Vol 38 (12) ◽  
pp. 4430-4438 ◽  
Author(s):  
Raphael P. Viscidi ◽  
James C. Demma ◽  
Jing Gu ◽  
Jonathan Zenilman

Typing of gonococcal strains is a valuable tool for the biological confirmation of sexual contacts. We have developed a typing method based on DNA sequencing of two overlapping por gene fragments generated by a heminested PCR. We compared sequencing of thepor gene (POR sequencing) and typing of the opagene (OPA typing) for the characterization of strains from 17 sexual partnerships. Both methods were highly discriminatory. A different genotype was detected in 15 of the 17 epidemiologically unconnected couples by POR sequencing and in 16 of the 17 couples by OPA typing with restriction enzyme HpaII. Within partnerships, identical genotypes were obtained from 16 of the 17 known sex contacts by POR sequencing and from 15 of the 17 by OPA typing. Compared to OPA typing, which relies on interpretation of bands in a gel, DNA sequence data offer the advantage of being objective and portable. As costs for sequencing decline, the method should become affordable for most laboratory personnel who wish to type gonococcal strains.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ying Zhang ◽  
Yupei Zhou ◽  
Wei Sun ◽  
Lili Zhao ◽  
D. Pavlic-Zupanc ◽  
...  

The genus Botryosphaeria includes more than 200 epithets, but only the type species, Botryosphaeria dothidea and a dozen or more other species have been identified based on DNA sequence data. The taxonomic status of the other species remains unconfirmed because they lack either morphological information or DNA sequence data. In this study, types or authentic specimens of 16 “Botryosphaeria” species are reassessed to clarify their identity and phylogenetic position. nuDNA sequences of four regions, ITS, LSU, tef1-α and tub2, are analyzed and considered in combination with morphological characteristics. Based on the multigene phylogeny and morphological characters, Botryosphaeria cruenta and Botryosphaeria hamamelidis are transferred to Neofusicoccum. The generic status of Botryosphaeria aterrima and Botryosphaeria mirabile is confirmed in Botryosphaeria. Botryosphaeria berengeriana var. weigeliae and B. berengeriana var. acerina are treated synonyms of B. dothidea. Botryosphaeria mucosa is transferred to Neodeightonia as Neodeightonia mucosa, and Botryosphaeria ferruginea to Nothophoma as Nothophoma ferruginea. Botryosphaeria foliicola is reduced to synonymy with Phyllachorella micheliae. Botryosphaeria abuensis, Botryosphaeria aesculi, Botryosphaeria dasylirii, and Botryosphaeria wisteriae are tentatively kept in Botryosphaeria sensu stricto until further phylogenetic analysis is carried out on verified specimens. The ordinal status of Botryosphaeria apocyni, Botryosphaeria gaubae, and Botryosphaeria smilacinina cannot be determined, and tentatively accommodate these species in Dothideomycetes incertae sedis. The study demonstrates the significance of a polyphasic approach in characterizing type specimens, including the importance of using of DNA sequence data.


2016 ◽  
Vol 36 (1) ◽  
Author(s):  
Paola Berchialla

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population.


2021 ◽  
Vol 22 (3) ◽  
pp. 505
Author(s):  
SONIA GIULIETTI ◽  
TIZIANA ROMAGNOLI ◽  
ALESSANDRA CAMPANELLI ◽  
CECILIA TOTTI ◽  
STEFANO ACCORONI

The ecology and seasonality of Pseudo-nitzschia species and their contribution to phytoplankton community were analysed for the first time at the coastal station of the LTER-Senigallia-Susak transect (north-western Adriatic Sea) from 1988 to 2020. Species composition was addressed using DNA sequence data obtained from 106 monoclonal strains isolated from January 2018 to January 2020. The mean annual cycle of total phytoplankton in the study period (Feb 1988–Jan 2020) showed maximum abundances in winter followed by other peaks in spring and autumn. Diatoms were the main contributors in terms of abundance during the winter and the spring blooms. The autumn peak was due to phytoflagellates and diatoms. In summer phytoflagellates dominated the community, followed by diatoms and dinoflagellates, which in this season reached their annual maximum. Pseudo-nitzschia spp. represented on average 0.4–17.6% of diatom community, but during their blooms they could reach up to up to 90% of the total diatom abundances with 106 cells l-1. By LM, six different taxa were recognized: Pseudo-nitzschia cf. delicatissima and P. cf. pseudodelicatissima were the most abundant, followed by P. cf. fraudulenta, P. pungens, P. multistriata and P. cf. galaxiae. P. cf. fraudulenta and P. pungens were indicator taxa of winter. P. cf. delicatissima and P. cf. pseudodelicatissima were spring and summer taxa, respectively. P. galaxiae showed maximum abundances in autumn. DNA sequences revealed the presence of two species belonging to the ’P. seriata group’ (i.e. P. fraudulenta and P. pungens) and four species belonging to the ‘P. delicatissima group’ (P. calliantha and P. mannii within the P. pseudodelicatissima species complex, and P. delicatissima and P. cf. arenysensis within the P. delicatissima species complex). The presence of several cryptic and pseudo-cryptic species highlights the need to combine LM observations with DNA sequence data when the ecology of Pseudo-nitzschia is investigated. 


2017 ◽  
Author(s):  
Erik Garrison ◽  
Jouni Sirén ◽  
Adam M. Novak ◽  
Glenn Hickey ◽  
Jordan M. Eizenga ◽  
...  

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.


2018 ◽  
Author(s):  
Michael Gruenstaeudl ◽  
Yannick Hartmaring

AbstractBackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.


Sign in / Sign up

Export Citation Format

Share Document