scholarly journals SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs

Author(s):  
Mohammed Alser ◽  
Taha Shahroodi ◽  
Juan Gómez-Luna ◽  
Can Alkan ◽  
Onur Mutlu

Abstract Motivation We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs. Results SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. Availabilityand implementation https://github.com/CMU-SAFARI/SneakySnake. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Baldomero Imbernón ◽  
Antonio Serrano ◽  
Andrés Bueno-Crespo ◽  
José L Abellán ◽  
Horacio Pérez-Sánchez ◽  
...  

Abstract Motivation Molecular docking methods are extensively used to predict the interaction between protein–ligand systems in terms of structure and binding affinity, through the optimization of a physics-based scoring function. However, the computational requirements of these simulations grow exponentially with: (i) the global optimization procedure, (ii) the number and degrees of freedom of molecular conformations generated and (iii) the mathematical complexity of the scoring function. Results In this work, we introduce a novel molecular docking method named METADOCK 2, which incorporates several novel features, such as (i) a ligand-dependent blind docking approach that exhaustively scans the whole protein surface to detect novel allosteric sites, (ii) an optimization method to enable the use of a wide branch of metaheuristics and (iii) a heterogeneous implementation based on multicore CPUs and multiple graphics processing units. Two representative scoring functions implemented in METADOCK 2 are extensively evaluated in terms of computational performance and accuracy using several benchmarks (such as the well-known DUD) against AutoDock 4.2 and AutoDock Vina. Results place METADOCK 2 as an efficient and accurate docking methodology able to deal with complex systems where computational demands are staggering and which outperforms both AutoDock Vina and AutoDock 4. Availability and implementation https://[email protected]/Baldoimbernon/metadock_2.git. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (24) ◽  
pp. 5243-5248 ◽  
Author(s):  
Ana S C. Silva ◽  
Robbin Bouwmeester ◽  
Lennart Martens ◽  
Sven Degroeve

Abstract Motivation The use of post-processing tools to maximize the information gained from a proteomics search engine is widely accepted and used by the community, with the most notable example being Percolator—a semi-supervised machine learning model which learns a new scoring function for a given dataset. The usage of such tools is however bound to the search engine’s scoring scheme, which doesn’t always make full use of the intensity information present in a spectrum. We aim to show how this tool can be applied in such a way that maximizes the use of spectrum intensity information by leveraging another machine learning-based tool, MS2PIP. MS2PIP predicts fragment ion peak intensities. Results We show how comparing predicted intensities to annotated experimental spectra by calculating direct similarity metrics provides enough information for a tool such as Percolator to accurately separate two classes of peptide-to-spectrum matches. This approach allows using more information out of the data (compared with simpler intensity based metrics, like peak counting or explained intensities summing) while maintaining control of statistics such as the false discovery rate. Availability and implementation All of the code is available online at https://github.com/compomics/ms2rescore. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Fergus Boyles ◽  
Charlotte M Deane ◽  
Garrett M Morris

Abstract Motivation Machine learning scoring functions for protein–ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein–ligand complex, with limited information about the chemical or topological properties of the ligand itself. Results We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest (RF) combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.836, 0.780 and 0.821 on the PDBbind 2007, 2013 and 2016 core sets, respectively, compared to 0.790, 0.746 and 0.814 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a RF using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets. Availability and implementation Data and code to reproduce all the results are freely available at http://opig.stats.ox.ac.uk/resources. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3739-3748
Author(s):  
Abhilesh S Dhawanjewar ◽  
Ankit A Roy ◽  
Mallur S Madhusudhan

Abstract Motivation The elucidation of all inter-protein interactions would significantly enhance our knowledge of cellular processes at a molecular level. Given the enormity of the problem, the expenses and limitations of experimental methods, it is imperative that this problem is tackled computationally. In silico predictions of protein interactions entail sampling different conformations of the purported complex and then scoring these to assess for interaction viability. In this study, we have devised a new scheme for scoring protein–protein interactions. Results Our method, PIZSA (Protein Interaction Z-Score Assessment), is a binary classification scheme for identification of native protein quaternary assemblies (binders/nonbinders) based on statistical potentials. The scoring scheme incorporates residue–residue contact preference on the interface with per residue-pair atomic contributions and accounts for clashes. PIZSA can accurately discriminate between native and non-native structural conformations from protein docking experiments and outperform other contact-based potential scoring functions. The method has been extensively benchmarked and is among the top 6 methods, outperforming 31 other statistical, physics based and machine learning scoring schemes. The PIZSA potentials can also distinguish crystallization artifacts from biological interactions. Availability and implementation PIZSA is implemented as a web server at http://cospi.iiserpune.ac.in/pizsa and can be downloaded as a standalone package from http://cospi.iiserpune.ac.in/pizsa/Download/Download.html. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4255-4263 ◽  
Author(s):  
Mohammed Alser ◽  
Hasan Hassan ◽  
Akash Kumar ◽  
Onur Mutlu ◽  
Can Alkan

AbstractMotivationThe ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm.ResultsShouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.Availability and implementationhttps://github.com/CMU-SAFARI/Shouji.Supplementary informationSupplementary data are available at Bioinformatics online.


2014 ◽  
Vol 51 ◽  
pp. 413-441 ◽  
Author(s):  
S. Cai ◽  
C. Luo ◽  
K. Su

It is widely acknowledged that stochastic local search (SLS) algorithms can efficiently find models for satisfiable instances of the satisfiability (SAT) problem, especially for random k-SAT instances. However, compared to random 3-SAT instances where SLS algorithms have shown great success, random k-SAT instances with long clauses remain very difficult. Recently, the notion of second level score, denoted as "score_2", was proposed for improving SLS algorithms on long-clause SAT instances, and was first used in the powerful CCASat solver as a tie breaker. In this paper, we propose three new scoring functions based on score_2. Despite their simplicity, these functions are very effective for solving random k-SAT with long clauses. The first function combines score and score_2, and the second one additionally integrates the diversification property "age". These two functions are used in developing a new SLS algorithm called CScoreSAT. Experimental results on large random 5-SAT and 7-SAT instances near phase transition show that CScoreSAT significantly outperforms previous SLS solvers. However, CScoreSAT cannot rival its competitors on random k-SAT instances at phase transition. We improve CScoreSAT for such instances by another scoring function which combines score_2 with age. The resulting algorithm HScoreSAT exhibits state-of-the-art performance on random k-SAT (k>3) instances at phase transition. We also study the computation of score_2, including its implementation and computational complexity.


Author(s):  
Jun Pei ◽  
Zheng Zheng ◽  
Hyunji Kim ◽  
Lin Song ◽  
Sarah Walworth ◽  
...  

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>


2021 ◽  
Vol 5 (EICS) ◽  
pp. 1-18
Author(s):  
Hae-Na Lee ◽  
Vikas Ashok ◽  
IV Ramakrishnan

Many people with low vision rely on screen-magnifier assistive technology to interact with productivity applications such as word processors, spreadsheets, and presentation software. Despite the importance of these applications, little is known about their usability with respect to low-vision screen-magnifier users. To fill this knowledge gap, we conducted a usability study with 10 low-vision participants having different eye conditions. In this study, we observed that most usability issues were predominantly due to high spatial separation between main edit area and command ribbons on the screen, as well as the wide span grid-layout of command ribbons; these two GUI aspects did not gel with the screen-magnifier interface due to lack of instantaneous WYSIWYG (What You See Is What You Get) feedback after applying commands, given that the participants could only view a portion of the screen at any time. Informed by the study findings, we developed MagPro, an augmentation to productivity applications, which significantly improves usability by not only bringing application commands as close as possible to the user's current viewport focus, but also enabling easy and straightforward exploration of these commands using simple mouse actions. A user study with nine participants revealed that MagPro significantly reduced the time and workload to do routine command-access tasks, compared to using the state-of-the-art screen magnifier.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Xujun Zhang ◽  
Chao Shen ◽  
Xueying Guo ◽  
Zhe Wang ◽  
Gaoqi Weng ◽  
...  

AbstractVirtual screening (VS) based on molecular docking has emerged as one of the mainstream technologies of drug discovery due to its low cost and high efficiency. However, the scoring functions (SFs) implemented in most docking programs are not always accurate enough and how to improve their prediction accuracy is still a big challenge. Here, we propose an integrated platform called ASFP, a web server for the development of customized SFs for structure-based VS. There are three main modules in ASFP: (1) the descriptor generation module that can generate up to 3437 descriptors for the modelling of protein–ligand interactions; (2) the AI-based SF construction module that can establish target-specific SFs based on the pre-generated descriptors through three machine learning (ML) techniques; (3) the online prediction module that provides some well-constructed target-specific SFs for VS and an additional generic SF for binding affinity prediction. Our methodology has been validated on several benchmark datasets. The target-specific SFs can achieve an average ROC AUC of 0.973 towards 32 targets and the generic SF can achieve the Pearson correlation coefficient of 0.81 on the PDBbind version 2016 core set. To sum up, the ASFP server is a powerful tool for structure-based VS.


Author(s):  
Matteo Chiara ◽  
Federico Zambelli ◽  
Marco Antonio Tangaro ◽  
Pietro Mandreoli ◽  
David S Horner ◽  
...  

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy   http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document