scholarly journals iBLAST: Incremental BLAST of new sequences via automated e-value correction

PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0249410
Author(s):  
Sajal Dash ◽  
Sarthok Rasique Rahman ◽  
Heather M. Hines ◽  
Wu-chun Feng

Search results from local alignment search tools use statistical scores that are sensitive to the size of the database to report the quality of the result. For example, NCBI BLAST reports the best matches using similarity scores and expect values (i.e., e-values) calculated against the database size. Given the astronomical growth in genomics data throughout a genomic research investigation, sequence databases grow as new sequences are continuously being added to these databases. As a consequence, the results (e.g., best hits) and associated statistics (e.g., e-values) for a specific set of queries may change over the course of a genomic investigation. Thus, to update the results of a previously conducted BLAST search to find the best matches on an updated database, scientists must currently rerun the BLAST search against the entire updated database, which translates into irrecoverable and, in turn, wasted execution time, money, and computational resources. To address this issue, we devise a novel and efficient method to redeem past BLAST searches by introducing iBLAST. iBLAST leverages previous BLAST search results to conduct the same query search but only on the incremental (i.e., newly added) part of the database, recomputes the associated critical statistics such as e-values, and combines these results to produce updated search results. Our experimental results and fidelity analyses show that iBLAST delivers search results that are identical to NCBI BLAST at a substantially reduced computational cost, i.e., iBLAST performs (1 + δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We then present three different use cases to demonstrate that iBLAST can enable efficient biological discovery at a much faster speed with a substantially reduced computational cost.

2018 ◽  
Author(s):  
Sajal Dash ◽  
Sarthok Rahman ◽  
Heather M. Hines ◽  
Wu-chun Feng

AbstractMotivationSearch results from local alignment search tools use statistical parameters sensitive to the size of the database. NCBI BLAST, for example, reports important matches using similarity scores and expect or e-values calculated against database size. Over the course of an investigation, the database grows and the best matches may change. To update the results of a sequence similarity search to find the most optimal hits, bioinformaticians must rerun the BLAST search against the entire database; this translates into irredeemable spent time, money, and computational resources.ResultsWe develop an efficient way to redeem spent BLAST search effort by introducing the Incremental BLAST. This tool makes use of the previous BLAST search results as it conducts new searches on only the incremental part of the database, recomputes statistical metrics such as e-values and combines these two sets of results to produce updated results. We develop statistics for correcting e-values of any BLAST result against any arbitrary sequence database. The experimental results and accuracy analysis demonstrate that Incremental BLAST can provide search results identical to NCBI BLAST at a significantly reduced computational cost. We apply three case studies to showcase different use cases where Incremental BLAST can make biological discovery more efficiently at a reduced cost. This tool can be used to update sequence blasts during the course of genomic and transcriptomic projects, such as in re-annotation projects, and to conduct incremental addition of taxon-specific sequences to a BLAST database. Incremental BLAST performs (1 + δ)/δ times faster than NCBI BLAST for δ fraction of database growth.AvailabilityIncremental BLAST is available at https://bitbucket.org/sajal000/[email protected] informationSupplementary data are available at https://bitbucket.org/sajal000/incremental-blast


Author(s):  
Yudong Qiu ◽  
Daniel Smith ◽  
Chaya Stern ◽  
mudong feng ◽  
Lee-Ping Wang

<div>The parameterization of torsional / dihedral angle potential energy terms is a crucial part of developing molecular mechanics force fields.</div><div>Quantum mechanical (QM) methods are often used to provide samples of the potential energy surface (PES) for fitting the empirical parameters in these force field terms.</div><div>To ensure that the sampled molecular configurations are thermodynamically feasible, constrained QM geometry optimizations are typically carried out, which relax the orthogonal degrees of freedom while fixing the target torsion angle(s) on a grid of values.</div><div>However, the quality of results and computational cost are affected by various factors on a non-trivial PES, such as dependence on the chosen scan direction and the lack of efficient approaches to integrate results started from multiple initial guesses.</div><div>In this paper we propose a systematic and versatile workflow called \textit{TorsionDrive} to generate energy-minimized structures on a grid of torsion constraints by means of a recursive wavefront propagation algorithm, which resolves the deficiencies of conventional scanning approaches and generates higher quality QM data for force field development.</div><div>The capabilities of our method are presented for multi-dimensional scans and multiple initial guess structures, and an integration with the MolSSI QCArchive distributed computing ecosystem is described.</div><div>The method is implemented in an open-source software package that is compatible with many QM software packages and energy minimization codes.</div>


2019 ◽  
Vol 14 (2) ◽  
pp. 157-163
Author(s):  
Majid Hajibaba ◽  
Mohsen Sharifi ◽  
Saeid Gorgin

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.


2021 ◽  
Vol 5 (2) ◽  
Author(s):  
Hannah C Cai ◽  
Leanne E King ◽  
Johanna T Dwyer

ABSTRACT We assessed the quality of online health and nutrition information using a Google™ search on “supplements for cancer”. Search results were scored using the Health Information Quality Index (HIQI), a quality-rating tool consisting of 12 objective criteria related to website domain, lack of commercial aspects, and authoritative nature of the health and nutrition information provided. Possible scores ranged from 0 (lowest) to 12 (“perfect” or highest quality). After eliminating irrelevant results, the remaining 160 search results had median and mean scores of 8. One-quarter of the results were of high quality (score of 10–12). There was no correlation between high-quality scores and early appearance in the sequence of search results, where results are presumably more visible. Also, 496 advertisements, over twice the number of search results, appeared. We conclude that the Google™ search engine may have shortcomings when used to obtain information on dietary supplements and cancer.


Electronics ◽  
2021 ◽  
Vol 10 (13) ◽  
pp. 1511
Author(s):  
Taylor Simons ◽  
Dah-Jye Lee

There has been a recent surge in publications related to binarized neural networks (BNNs), which use binary values to represent both the weights and activations in deep neural networks (DNNs). Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on ASICs and FPGAs. While BNNs are excellent candidates for these kinds of resource-limited systems, most implementations still require very large FPGAs or CPU-FPGA co-processing systems. Our work focuses on reducing the computational cost of BNNs even further, making them more efficient to implement on FPGAs. We target embedded visual inspection tasks, like quality inspection sorting on manufactured parts and agricultural produce sorting. We propose a new binarized convolutional layer, called the neural jet features layer, that learns well-known classic computer vision kernels that are efficient to calculate as a group. We show that on visual inspection tasks, neural jet features perform comparably to standard BNN convolutional layers while using less computational resources. We also show that neural jet features tend to be more stable than BNN convolution layers when training small models.


Author(s):  
Michael Nierla ◽  
Alexander Sutor ◽  
Stefan Johann Rupitsch ◽  
Manfred Kaltenbacher

Purpose This paper aims to present a novel stageless evaluation scheme for a vector Preisach model that exploits rotational operators for the description of vector hysteresis. It is meant to resolve the discretizational errors that arise during the application of the standard matrix-based implementation of Preisach-based models. Design/methodology/approach The newly developed evaluation uses a nested-list data structure. Together with an adapted form of the Everett function, it allows to represent both the additional rotational operator and the switching operator of the standard scalar Preisach model in a stageless fashion, i.e. without introducing discretization errors. Additionally, presented updating and simplification rules ensure the computational efficiency of the scheme. Findings A comparison between the stageless evaluation scheme and the commonly used matrix approach reveals not only an improvement in accuracy up to machine precision but, furthermore, a reduction of computational resources. Research limitations/implications The presented evaluation scheme is especially designed for a vector Preisach model, which is based on an additional rotational operator. A direct application to other vector Preisach models that do not rely on rotational operators is not intended. Nevertheless, the presented methodology allows an easy adaption to similar vector Preisach schemes that use modified setting rules for the rotational operator and/or the switching operator. Originality/value Prior to this contribution, the vector Preisach model based on rotational operators could only be evaluated using a matrix-based approach that works with discretized forms of rotational and switching operator. The presented evaluation scheme offers reduced computational cost at much higher accuracy. Therefore, it is of great interest for all users of the mentioned or similar vector Preisach models.


2018 ◽  
Vol 10 (4) ◽  
pp. 1
Author(s):  
Mileidy Alvarez-Melgarejo ◽  
Martha L. Torres-Barreto

The bibliometric method has proven to be a powerful tool for the analysis of scientific publications, in such a way that allows rating the quality of the knowledge generating process, as well as its impact on firm´s environment. This article presents a comparison between two powerful bibliographic databases in terms of their coverage and the usefulness of their content. The comparison starts with a subject associated to the relationship between resources and capabilities. The outcomes show that the search results differ between both databases. The Web Of Science (WOS), has a greater coverage than SCOPUS has.  It also has a greater impact in terms of most cited authors and publications. The search results in the WOS yield articles from 2001, while Scopus yields articles from 1976, however, some of the latter are inconsistent with the topic being searched. The analysis points to a lack of studies regarding resources as foundations of firm´s capabilities; as a result, new research on this field is suggested.


Author(s):  
Renata Marques de Oliveira ◽  
Alexandre Freitas Duarte ◽  
Domingos Alves ◽  
Antonia Regina Ferreira Furegato

ABSTRACT Objective: to develop a mobile app for research on the use of tobacco among psychiatric patients and the general population. Method: applied research with the technological development of an app for data collection on an Android tablet. For its development, we considered three criteria: data security, benefits for participants and optimization of the time of researchers. We performed tests with twenty fictitious participants and a final test with six pilots. Results: the app collects data, stores them in the database of the tablet and export then to an Excel spreadsheet. Resources: calculator, stopwatch, offline operation, branching logic, field validation and automatic tabulation. Conclusion: the app prevents human error, increases the quality of the data by validating them during the interview, allows the performing of automatic tabulation and makes the interviews less tiring. Its success may encourage the use of this and other computational resources by nurses as a research tool.


2018 ◽  
Author(s):  
Benjamin Brown-Steiner ◽  
Noelle E. Selin ◽  
Ronald Prinn ◽  
Simone Tilmes ◽  
Louisa Emmons ◽  
...  

Abstract. While state-of-the-art complex chemical mechanisms expand our understanding of atmospheric chemistry, their sheer size and computational requirements often limit simulations to short length, or ensembles to only a few members. Here we present and compare three 25-year offline simulations with chemical mechanisms of different levels of complexity using CESM Version 1.2 CAM-chem (CAM4): the MOZART-4 mechanism, the Reduced Hydrocarbon mechanism, and the Super-Fast mechanism. We show that, for most regions and time periods, differences in simulated ozone chemistry between these three mechanisms is smaller than the model-observation differences themselves. The MOZART-4 mechanism and the Reduced Hydrocarbon are in close agreement in their representation of ozone throughout the troposphere during all time periods (annual, seasonal and diurnal). While the Super-Fast mechanism tends to have higher simulated ozone variability and differs from the MOZART-4 mechanism over regions of high biogenic emissions, it is surprisingly capable of simulating ozone adequately given its simplicity. We explore the trade-offs between chemical mechanism complexity and computational cost by identifying regions where the simpler mechanisms are comparable to the MOZART-4 mechanism, and regions where they are not. The Super-Fast mechanism is three times as fast as the MOZART-4 mechanism, which allows for longer simulations, or ensembles with more members, that may not be feasible with the MOZART-4 mechanism given limited computational resources.


2013 ◽  
Vol 378 ◽  
pp. 546-551 ◽  
Author(s):  
Joanna Strug ◽  
Barbara Strug

Mutation testing is an effective technique for assessing quality of tests provided for a system. However it suffers from high computational cost of executing mutants of the system. In this paper a method of classifying such mutants is proposed. This classification is based on using an edit distance kernel and k-NN classifier. Using the results of this classification it is possible to predict whether a mutant would be detected by tests or not. Thus the application of the approach can help to lower the number of mutants that have to be executed and so also to lower the cost of using the mutation testing.


Sign in / Sign up

Export Citation Format

Share Document